Q1: The rule says:
- Training data 2015 and earlier
- Test data 2016 and later
The current data conflicts with these rules. There is no training record for year 2015.
There are test records for year 2014 and 2015. Can we assume there is no test record appear in the training years, as they could change the data distribution used for training the model?
Q2. The question we care more about is whether all the test records are REAL records and there is no decoy randomized records. We plan to create new features relying on variable distributions, but that also requires us to create new features for the test set, if test set contains randomized random, it will defeat our feature engineering effort.
Thanks