😱 Why is there a `product_id` column in the Task 1 test set ? 😱

The v0.1 of the Task-1 test set only had the following columns : query_id, query, query_locale.

In the v0.2 release, we introduced a separate column, the product_id, and there has been some confusion as to why !

We understand, and Let us explain !

The reason the v2.0 test set of Task 1 includes pairs of query_id and product_id, is because we only have the esci_labels for those specific query_id and product_id pairs, and we can only consider those pairs for computing the nDCG score. The other query_id and product_id pairs that do not appear in the test set will not be considered for computing the nDCG score, since we do not have the corresponding esci_label.

This helps participants obtain a more meaningful nDCG score.

To help clarify this better, lets imagine a situation where we only want the top three ranked products for the query included in the test set:

Version 0.1 dataset


query_1, "some query", us


query_1, product_11

query_1, product_7

query_1, product_9

If we do not have the esci_label for any of those three pairs (say for query_1, product_7 pair), then we will omit those pairs, and compute the nDCG using only the query_1, product_11 and query_1, product_9 pairs.

Version 0.2 dataset:


query_1, product_1, locale_us

query_1, product_3, locale_us

query_1, product_10, locale_us


query_1, product_10

query_1, product_1,

query_1, product_3

query_1, product_39

In the output, we have all the pairs annotated except the query_1, product_39 pair.

But if the participants already had the information of which query-product pairs had esci_labels available, then they should not have included the query_1, product_39 pair to begin with, as it is not included the test set.

So in this case, we will compute the nDCG score considering the 3 ranked products that were included in the test set, and not the query_1, product_39 pair which was not included in the test set.

In other words, the question to answer is, how best to “order” the test set of Task-1, to optimize your nDCG score.

Hope that clarifies the confusion.

If you have any more queries, please do not hesitate to reach out to us.




This is still unclear (and uncommon for retrieval tasks).
Do you mean that you already provide the product recall set for each query and participants are required to rerank that list only?

hi g_t. Yes this is not a typical retrieval task. What you see in the test set is a product recall set for that query, and the ask is to reorder these in the best way possible. This is typical in several retrieval applications where there’s a retrieval step (matching) and a reordering step (ranking).

To compute the NDCG, the ground truth only has labels for the pairs in this set.

thanks for the quick answer.
I still have a question about the graded relevance for NDCG calculation…
How do you map the various labels to grades?
i.e., is it something like irrelevant → 0, exact-> 2, substitute → ?, complement → ?
Thanks again for quick response!


In our case we have 4 degrees of relevance (rel) for each query and product pair: Exact , Substitute , Complement and Irrelevant ; where we set a gain of 1.0 , 0.1 , 0.01 and 0.0 , respectively.

so you can treat this as some kind of weighted NDCG

Hi, nikhil_rao. I have a question about task1’s test set.

How to validate our model performance locally when test set’s column has no ‘esci_label’?

I mean how to calculate NDCG’s rel_i when we want to verify our model performance in test set.

Thanks for response!