Hi. While investigating the datasets in detail, we found out that there are multiple rows with the same clinical trial id, and just with different indications and outcomes. Since we are trying to predict the probability of a success of a particular trial, shouldn’t all clinical trial IDs be unique? Because now it seems that we are trying to predict the success of a particular indication/trial pair, and not the trial itself, which changes a lot.
Also, there are 506 clinical trial IDs that are the same in the test and train dataset, which might cause a data leak problems.