there is no way the score of 294 could have been achieved without having the target leak into the test set.
Could they have abused the way the testing is done, year 1 data being used to test year 1, year 1-2 data to test year 2, etc? Are there records on the test set that could be inferred if you were to save long lists of policy ids + years to manually identify them?
It seems very very strange to me too. In any case, the leaderboard that really matters is the profit one…
Indeed this value is not correct, we’ve already discussed it with the participant (who raised the issue themselves!) and the submission will be removed shortly.
To explain what is going on, when lot’s of nan
values are in the prices those prices used to be still passed on to the evaluator and promptly ignored.
This will be fixed quite soon by ensuring that the prices do not contain nan
values.
To be clear, there is no target leakage.
Sorry for the brief confusion!
whew thanks, this is my favorite competition in a long time!
@alfarzan sorry, just to clarify - but do you mention the features or prices as NAN? Some features of new dataset can stay to NAN, as it should be in reality … just wanted to understand if the final/target dataset will contain some NAN as the historical one…
Hi @sandrodjay1
I was referring to the prices.
The final test data will have exactly the same level of cleaning as the training data. So yes, it will likely contain some NaN
s.
Perfect Thanks!