Very high scores for task 1. Maybe make it again as full retrieval?

Dear ESCI organizers,

I’m participating in task 1 of the ESCI competition. The task is now a reranking task, in which only ~20 products should be reordered for a given query.

I believe this change made the task too “easy,” with the best teams already achieving an NDCG close to 0.90. This is probably close to the inter-annotator agreement. That is, if we use a different set of human experts to annotate the test data, probably the ranking of the top-performing teams will change quite a bit.

In summary, I’m afraid that if you use the current evaluation method in the private leaderboard, you will not be able to accurately select the best reranking method.

Have you thought of making it again as a full retrieval task? It will make it closer to a real-world problem and also harder, which will help differentiate the best algorithms.

Regarding the problem of (query_id, product_id) pairs that weren’t annotated but were retrieved by some submissions, annotating only the missing pairs of the top-10 submissions shouldn’t take long if you annotate 50 or 200 queries. Since this annotation strategy is costly, it should be used only for the private leaderboard. For the public leaderboard, you can use NDCG’ (“NDCG prime”), in which only the annotated pairs are used to compute the nDCG score. This metric has a high correlation with NDCG + “full” annotations.

Rodrigo Nogueira


Sure, the baseline performance is very closed with now highest score if we only use NDCG as evaluation.