Amazon KDD Cup 2022 Experience Report

Hi All,

I had a wonderful time participating in the Amazon KDD Cup challenge. I have written down an experience report of my participation in this competition. Here is the blog link to the same - AICrowd Amazon KDD Search Relevance Hackathon 2022 Experience Report | fastpages

Also, inlining it here for everyone to view.

AICrowd Amazon KDD Search Relevance Hackathon 2022 Experience Report

Sharing experience of my participation in Amazon KDD Cup Challenge

  • Goal of Experience Report
  • About Amazon KDD Cup 2022
    • Task 1 - Query Product Ranking
    • Task 2 - Multiclass Product Classification
    • Task 3 - Product Subtitute Identification
  • Phase 1: Reproduce the Baseline
    • Step 1: Submit the AICrowd template/dummy version
    • Step 2: Training the ESCI Baseline Model and making the first real submission
  • Phase 2: Improving the Baseline
    • Exploratory Data Analysis
    • Iteration 1: Including more fields in model training
    • Iteration 2: Unifying the Model
    • Iteration 3: Fixing Training Data
    • Iteration 4: Deeper Training
  • Submissions Pile-Up
  • Final Results
  • Conclusion
  • Links

Goal of Experience Report

This is my first NLP Hackathon.

I have been self-studying NLP over past year, and have thoroughly enjoyed this interesting and full of potential technology field. The NLP community also have been so open and helpful. Every day some or the other team publishes their reproducible work helping the practioners, learners and enthusiasts of this field to up-to-date themselves with the most bleeding edge of research findings.

My goal of sharing this experience report is to give back to this community, and hope that someone in future finds it useful to get started with their first ML/NLP Hackathons.

About Amazon KDD Cup 2022

Amazon KDD Cup Challenge is a hackathon hosted by AICrowd platform. The hackathon involves improving the customer experience and engagement by improving the search relevance significantly using the cutting edge research in the fields of Search, NLP, Deep Learning and Vector Embeddings.

The dataset includes 130k+ queries, 1M+ product catalogs, 2.6M+ judgements, with data distributed across us, es and jp locales. It is one of the largest multi-lingual search relevance based dataset I have seen in hackathons.

The hackthon was kicked off on March 15th 2002, with the final submission on July 20th 2022.

The hackathon involves 3 separate tasks of improving the identification of products as per ESCI (Exact, Substitute, Complement, Irrelevant) categories.

Task 1 - Query Product Ranking

Given a search query, rank potential product matches based on their relevance

Task 2 - Multiclass Product Classification

Given a asearch query and product pair, classify the pair among as ESCI category

Task 3 - Product Subtitute Identification

Given a search query and product pair, predict if product partially fulfills the search criteria and can be used as functional substitute

Apart from the cash prizes, the winners of the KDD Cup challenge will also get a chance to showcase their winning approach as paper in SIGKDD Conference 2022.

I only attempted task 1, as I joined the competition very late and wanted to focus on getting a better rank in one task, than baseline in all 3. I will share my experience solving for task 1 here.

Phase 1: Reproduce the Baseline

The Amazon KDD Challenge also provided a baseline code and accompanying paper detailing the baseline approach.

For task 1 - query product ranking, the baseline approach used MS MARCO Cross Encoders for US, and multilingua MPNet for ES and JP locales.

As part of Phase 1, I tried to reproduce this baseline in AICrowd.

Step 1: Submit the AICrowd template/dummy version

AICrowd provided a template repo that accepted the solution based on their submission format. Note, this repo structure was different from the esci-baseline on github above.

To trigger the evaluation, you have to push a tag with name like β€œsubmission-*”. Only that tag is going to trigger the evaluation and give you the results.

The template repo provided with, which had a dummy implementation for prediction. Just to test, end-to-end was working fine, checked in the dummy implementation, and pushed the changes. To trigger the CI and evaluation step, created git tag submission-initial-versionand pushed the tag to remote.

As expected, the tag was picked by the CI process, and after a bit of wait, the server gave me my first result:

πŸ† Scores
NDCG Score : 0.7486

With this, I had a spot on the leaderboard as well :relaxed:.

Step 2: Training the ESCI Baseline Model and making the first real submission

The esci-baseline solution on github, and baseline repo on AICrowd were very different in their structure. So to get started, one had to really understand the esci-baseline, copy over the training and predictions components, and plug-in with the forked AICrowd gitlab baseline.

IMHO, much time could have been saved if there was consistency between both of these repos.

As a golden rule of any Machine Learning project, my first goal was to get a baseline submission through, and then iteratively tweek my approach to better result.

Once, I refactored and merged the 2 baseline repos provided in a working condition on my machine, the next was to get my model trained.

I used as my cloud GPU provider as its pay-as-you-use approach, and array of GPU selections satisfied my requirements.

I pushed my changes to AICrowd gitlab, and cloned it back on the lambdalabs persistent storage disk. I also setup a virtual env on the persistent disk so as not to set it up everytime I spin my VM.

Basic profiling showed me the baseline approach was not optimized or utilized multiple GPUs, so I stick to low-cost GPUs (A6000, $0.80/hr) for training the model.

Once my model was trained, I tested a dummy evaluation cycle on the VM itself, and it produced the results.csv. This gave me confidence that all the pieces of code are working as expected. Then, I checked my model files using git-lfs and pushed the code to AICrowd repos. To trigger the CI step, I created the git tag submission-initial-version and pushed to remote.

This tag failed, as on the CI setup, the products-catalogue file was missing. This took sometime to debug as even the git logs were not available. I raised the issue on discussion board and discord. I was advised to check in the catalogue file, so I did.

After some attempts, the tag went through and produced result based on the model:

πŸ† Scores
NDCG Score : 0.8505

An increase of ~0.1 from the dummy random prediction version. Good start but not as much as I was hoping for. With this my ranking improved significantly on leaderboard and was under 20.

Phase 2: Improving the Baseline

Exploratory Data Analysis

I started off with basic EDA on the data. Few of the results of data exploration that I later incorporated in my model:

  1. There were other fields in the data which had rich information. In the baseline, we were only using the title field, so there was a venue to improve prediction including other fields as well
  2. There were some missing fields, but overall, there were only few rows where all the fields were missing
  3. The non-alphanum/symbols in the data were of significant context as they represented some domain information. E.g. # was extensively used to denote the size or model of the object like lipstick, pen etc. So in general, clean up of data was limited to removing unnecessary spaces, and lowercasing for consistency
  4. Data was deliberately and incompletely shuffled. So there were rows with same query_id but different query values, as well as there were rows with different query_id and same query values.

Iteration 1: Including more fields in model training

As next step, I included all the fields after doing clean up in the model training. The code looked like:

# cleaning title
df_p[col_title_clean] = df_p[col_product_title] \
    .str.lower() \
    .str.replace(replace_pattern, " ", regex=True) \
    .str.replace("\s{2,}", " ", regex=True) \

# ...

# merging all fields under a single field for training
df_p[col_text_all] = df_p.apply(
    lambda df: '<id> ' + df[col_product_id] + ' <id>' + \
               ' <title> ' + df[col_title_clean] + ' <title>' + \
               ' <brand> ' + df[col_product_brand] + ' <brand>' + \
               ' <color> ' + df[col_color_clean] + ' <color>' + \
               ' <desc> ' + df[col_description_clean] + ' <desc>' + \
               ' <bullet> ' + df[col_bullet_clean] + ' <bullet>', axis=1)

I was hoping this would give significant boost to my result, but I got the results below for above changes:

πŸ† Scores
NDCG Score : 0.8404

This was a ~0.01 drop to my previous result where I was only including the title field. There was something wrong with my approach that I needed to debug.

Iteration 2: Unifying the Model

The baseline was having different models and approaches based on locale. So in all, it had 3 different models for each of the 3-locales - US, ES and JP. This was making it hard to optimize as you need to take multiple approaches based on locale. I thought of testing if having a single multi-lingual model and a uniform approach would provide a comparative result, and then take steps to optimize my results.

I chose cross-encoder/mmarco-mMiniLMv2-L12-H384-v1 as it was the same cross-encoder model, but having multi-lingual support. Luckily, it also had support for ES and JP locales.

So I ditched the esci-baseline approach of 3 different models, and trained all of my data on this single model and got the result:

πŸ† Scores
NDCG Score : 0.8813

Awesome, not only I got better result from this approach, the boost was a significant ~0.03 over my last best result, and ~0.04 over my approach of including more fields for training.

Iteration 3: Fixing Training Data

During the EDA, I saw a pattern in the training data provided. There were rows in the data where query_id was same, but query value differed, and then there were rows, where query value was same but query_id were different. A hunch was that the data was tampered to create randomness.

I decided to fix this issue and see if the model performance improved. I grouped the rows by query_text, and then reset the query_id to match them. This was because the ESCI label matched more with query_text than with query_id.

I was hoping some improvement in model performance, but the result remained same:

πŸ† Scores
NDCG Score : 0.8813

No worries, on to next iteration and keeping fingers crossed for result improvement.

Iteration 4: Deeper Training

After exploring multiple venues and not finding any promising leads on improving the performance, last arrow in my quiver was to train the model deeper. Earlier I was training the model for a single epoch, I increased it to 3 to see if it improved the performance. The result I got was:

πŸ† Scores
NDCG Score : 0.8847

An improvement of ~0.003. Though this helped with ranking, but also took thrice as longer to train and as much the cost.

Submissions Pile-Up

As the hackathon came close to the final date, there was a rush of submissions. This resulted in wait time as much as 18hrs.

This was a bit demotivating in the last stretch. After waiting long for my submission result, I had to quit as well, as the anxiety to wait for the results of your change was very unproductive and spilled to other tasks at hand.

Final Results

The evaluation phase had 2 set of data, one public test data and a private test data. This is done so as not to reward solutions that are overfit for public test data, but unseen test data as well.

In the public test, I was hovering over rank ~23. Once the competition was closed the private test data were run and results shared. There my ranking dropped by 2 to the final ranking of 25.

I quick glance of the top teams, most of them had some academic connection and were pursuing the ML/NLP subjects at research level. Also, most of them were located in US, China or Japan.

I was very happy to find that I was probably the topmost submission from India.


In the end, I was happy with the hard-work I put up in this hackathon and very proud of results as this was my first hackathon. I learned a lot about NLP, Word Embeddings, Large-Language-Models, Multi-lingual models, EDA, Data Cleaning. I was able to put in my learning in a very close-to-real-life business problem. I interacted with the organizers, participants, and made some interesting connections on Discord and LinkedIn.

I want to thank the AICrowd team for organizing such massive event. They did their best given the challenge and demand for task of this magnitude. I also want to thank the participants, who were ever so helpful in case you are stuck and need help.

Next, looking forward to more interesting and engaging hackathons in the near future.



I was also doing a cohort-based course Search with ML. I shared the news of completing that course, as well as scoring a rank on LinkedIn garnering ~2850 impressions, and 30 Likes.

1 Like