Task1 has query/product pairs with empty documents both in training and test datasets

Hi all, I found some problems with the datasets for Task 1.

1st: Training data has annotated query/product pairs, whose products are empty in all fields.


2nd: The same is observed on the test set which is the major problem, as we have no basis on how to rank them.


3rd: Back on the training set, there are some Korean, Chinese and Arabic queries marked as English.

Let me know if you’re not able to reproduce and I’ll provide the code.


The same issue I’ve also found.


Anybody from the AIcrowd Team had a chance to take a look at the issue?

I also found out that on both task1 and 2 training sets, there are duplicates of annotated query/product pairs with different “esci_label”, including pairs that are annotated with both “exact” and “irrelevant” labels.

Although they posses different "product_id"s, when concatenating all fields from product_catalogue-v0.2.csv, they end up with the exact same description.

From task1:

From task2:

1 Like