Endphaseyear in train and test set

Q1: The rule says:

  • Training data 2015 and earlier
  • Test data 2016 and later

The current data conflicts with these rules. There is no training record for year 2015.
There are test records for year 2014 and 2015. Can we assume there is no test record appear in the training years, as they could change the data distribution used for training the model?

Q2. The question we care more about is whether all the test records are REAL records and there is no decoy randomized records. We plan to create new features relying on variable distributions, but that also requires us to create new features for the test set, if test set contains randomized random, it will defeat our feature engineering effort.

Thanks

Hi,

Q1: The split was done on OutcomeYear, which was removed from the training data set, as this would not be available directly following a phase 2.

EndphaseYear would be available, and was left in to carry the temporal signal, as approval base percentage has been declining.

Q2: Great question. All records are real trials. The trial ID’s are also as given in the raw datra and were not randomized.