Duplicates of intClinicalTrialID

michal-pikusa · December 3, 2019, 2:07pm

Hi. While investigating the datasets in detail, we found out that there are multiple rows with the same clinical trial id, and just with different indications and outcomes. Since we are trying to predict the probability of a success of a particular trial, shouldn’t all clinical trial IDs be unique? Because now it seems that we are trying to predict the success of a particular indication/trial pair, and not the trial itself, which changes a lot.

Also, there are 506 clinical trial IDs that are the same in the test and train dataset, which might cause a data leak problems.

kelleni2 · December 3, 2019, 3:44pm

hi - yes many columns including trial ID are not unique

a trial can have multiple indications officially

a clinical “program” which we trying to predict is the approval of a drug-indication pair - and that should always have the same label in the data.

maruthi0506 · December 4, 2019, 6:28am

Isn’t the drug-indication-trialid combination unique? This is because we see different trials for same drug-indication pair.