Target leakage?

MakePredict · January 8, 2021, 7:46pm

there is no way the score of 294 could have been achieved without having the target leak into the test set.

MakePredict · January 8, 2021, 7:51pm

Could they have abused the way the testing is done, year 1 data being used to test year 1, year 1-2 data to test year 2, etc? Are there records on the test set that could be inferred if you were to save long lists of policy ids + years to manually identify them?

MakePredict · January 8, 2021, 7:56pm

Tell me this doesn’t look sketchy.

RHG · January 8, 2021, 7:59pm

It seems very very strange to me too. In any case, the leaderboard that really matters is the profit one…

alfarzan · January 8, 2021, 8:02pm

Indeed this value is not correct, we’ve already discussed it with the participant (who raised the issue themselves!) and the submission will be removed shortly.

To explain what is going on, when lot’s of nan values are in the prices those prices used to be still passed on to the evaluator and promptly ignored.

This will be fixed quite soon by ensuring that the prices do not contain nan values.

To be clear, there is no target leakage.
Sorry for the brief confusion!

MakePredict · January 8, 2021, 8:04pm

whew thanks, this is my favorite competition in a long time!

sandro_djay · January 9, 2021, 11:39am

@alfarzan sorry, just to clarify - but do you mention the features or prices as NAN? Some features of new dataset can stay to NAN, as it should be in reality … just wanted to understand if the final/target dataset will contain some NAN as the historical one…

alfarzan · January 9, 2021, 11:42am

Hi @sandrodjay1

I was referring to the prices.

The final test data will have exactly the same level of cleaning as the training data. So yes, it will likely contain some NaNs.

sandro_djay · January 9, 2021, 11:46am

Perfect Thanks!