But what about the data leak?

vikramjeet_das · September 1, 2020, 5:12am

Most people probably know by now there is a data leak in the dataset (including test set). Not going to describe it here, but is that going to be rectified or is that how we roll till the end?

It isn’t guaranteed a model trained and tested on such a dataset is going to do well in other real-world scenarios. Also, it promotes non-learning based hardcoded approaches which would definitely fail on any well prepared data.

ashivani · September 1, 2020, 11:19am

Hey,

We know about the problem and are looking for the possible solutions, Will update shortly on the issue.

Regards
Ayush

vikramjeet_das · September 3, 2020, 3:21pm

Any update on this?

ashivani · September 3, 2020, 7:01pm

Hey,

The final announcement will be made on Saturday regarding it, either a new dataset will be provided or the problem’s weightage will be made zero.

Regards
Ayush

AnimeshSinha1309 · September 5, 2020, 5:31pm

I hope you will consider that many teams did not exploit the data leak and spent a lot of time trying out different models, and the implementation that gets a 99.9% is certainly not a leak. I hope you will add additional data and check, since even the Kaggle precedent dictates that if a data leak is found the question stands and the teams which found it do get an unfair advantage. But testing with more data seems to be the best. Slashing off the question seems highly unfair.

ashivani · September 6, 2020, 12:14pm

Hey,

We will not be updating the data or stashing the question. We will stick to the current dataset since many team have spent their time solving the problems without taking the leak into account. But to ensure the leak is not exploited, the code from the top teams will be manually validated. If found, using the leak, the said teams will be disqualified.

Regards
Ayush