Calling on the organizer team to ban using external data in online code submission

@mohanty @shivam

In a recent post, one of the competition organizers stated that “you can add more information about the products in your submission”. We believe that this is not a fair rule. Allowing crawling external data (which could be images, product descriptions, and more) for online code submission would introduce serious data leakage problem, as participants may match the product information of the online private testset with the crawled external data and thus obtaining direct predictions. There ALWAYS EXISTS the probability that the crawled data contain some of the product information of the private testset.

As the organizer said in another post, the core spirit of this competition is “to collectively better understand the potential solutions to the tasks posed in this competition”. We believe that this rule is against the spirit. Participants should provide solutions that actually help solving the real world problem of product searching, and taking advantage of the potential data leakage in self-crawled external data is not really helping on the problem, but only on improving the score.

We propose to ban using external data in online code submission. If this would be really unacceptable, changing the product catalog CSV of the private testset would be an alternative way to alleviate (although not completely prevent) data leakage, as the current product catalog CSV is the same as the publically available one. The changed CSV should include new product entries that are not present in any publically available product catalog CSVs.

7 Likes

Hi @TransiEnt ,

We communicated the response from the Amazon team on this query around the usage of external data.

We acknowledged the points raised by you, and are communicating them back to the Amazon team for further review.

Thanks,
Mohanty

2 Likes

Although our team does not use any external data, we do not support change the rule any more. Keeping change rule makes this the competition like a joke and make all of us tired!

Please do not change any rule again and I believe that the host have promised before in the deadline extension poster. @mohanty

Note that different from task 1 that @TransiEnt focus on, task 2 required many efforts to made the code more efficient. So we applied pre-process to tokenizer all products. Beside, the product id itself is also a feature and we put it to the transformers. If the product is disputed. All of my model need to be retrained and I have no enough computing resources. So it is impossible to ban an product id here.

The only way to ban external data is to inspect code afterwards which become extremely hard for the host to do.

Best
Fanyou

@wufanyou : Please note that before awarding any of the prizes, there is a due diligence process in place, which would, among other things, also involve a manual review and inspection of the code to ensure that the award winning submissions are not engaging in any malpractice. The review will be done on the code that was actually executed to generate the associated scores (even if participants delete the said code from gitlab.aicrowd.com, the evaluation servers retain a snapshot of the code used for evaluation).

Best,
Mohanty

If manually code review could be done. Then I support to ban external data as our team might benefit from it. But my stand of view is still from the rule itself. It is really not a wise idea to change anything at this stage.

@wufanyou : There is no intent to change any Rules at all. This is a clarifying question from a participating team that we are trying to address together with the Amazon team. And we appreciate all your inputs.

Best,
Mohanty

Hi @mohanty : If you don’t change the product_catalogue, the best way now is to manually review the code and data to prevent the use of external data during the inference, otherwise it will be unfair to many contestants who don’t use external data. I hope you can give a fair plan as soon as possible

There is another way for the fairness that request all the teams to publish their external data.

@TransiEnt : The organizers of this competition, reviewed the points raised by you, and still hold the stance that it is okay to use External Datasets in preparation of your submissions (during the training phase), and also during the inference. Your submissions are still expected not to fail in case of unseen product_ids.

Additionally, the organizers do not believe that the use of External Datasets provide any unfair advantage to any specific team.

OK I got it. Thank you.