[ETS-Lab] Our solution

Thanks the AIcrowd team and the Amazon search team to organize this extensive competition. Finally, this game is ended. Our team learned a lot here and we believe this memorable period will help a lot in our future. Here we generally introduce our solution for this competition.

General solution

  • We trained 3 cross encoder models (DebertaV3, CocoLM, and Bigbird) for each language which differs in the pertained models, training method (e.g., knowledge distillation), and data splitting. In total, six identical models (2 folds x 3 models) for each language are used to produce the initial prediction (4 class probability) of the query-product pair. Use those models only, the public set score for task 2 is around 0.816.

  • For Task 1, we used the output 4 class probability with some simple features to train a lightgbm model, calculate the expected gain (P_e*1 + P_s*0.1 + P_c*0.01), and sort the query-product list by this gain. This is method is slightly better than using LambdaRank directly in LightGBM.

  • For task 2 and Task 3, we used lightgbm to fuse those predictions with some important features. Most important features are designed based on the potential data leakage from task 1 and the behavior of the query-product group:

    • The stats (min, medium, and max) of the cross encoder output probability grouped by query_id (0.007+ in Task 2 Public Leaderboard)
    • The percentage of product_id in Task 1 product list grouped by query_id (0.006+ in Task 2 Public Leaderboard)

Small modification towards Cross Encoder architecture

  • As the product context has multiple fields (title, brand, and so on), we use neither the cls token nor mean (max) pooling to get the latent vector of the query-product pair. Instead, we concatenate the hidden states of a predefined token (query, title, brand color, etc.). The format is:
    [CLS] [QUERY] <query_content> [SEP] [TITLE] <title_content> [SEP] [BRAND] <brand_content> [SEP] ...
    where [TEXT] is the special token and <text_content> is the text contents.

Code submission speed up

  1. Pre-process product token and save it as an HDF5 file.
  2. Transfer all models to ONNX with FP16 precision.
  3. Pre-sort the product id to reduce the side impact of batch zero padding.
  4. Use a relatively small mini-batch size when inference (batch size = 4).

You can find our training code here and code submission here.


Currently, I am seeking either a machine learning engineer or a research scientist job in the US. Collaborated with my friend Yang @Yang_Liu_CTH, I won some champions and runner-ups in many competitions including the champion of the KDD CUP 2020 reinforcement learning track. You can email me directly or go to my personal website for more details.

Dr. Wu, Fanyou
Postdoc @ Purdue University


Thanks a lot for your solution write-up!
I am very curious to see what these ‘magical’ features will give on the private set.
I will take a closer look at transforming the models into ONNX ones, it is because of the space and time restictions that I could only make predictions with 2 models.

1 Like

That task one feature will probably work on the private dataset as well that’s why use used it. If you check the product list in task two and task one. You will find a special pattern of the product order. In general, the product list is sorted as a training set, private set, and public set. Another reason why this feature will work is that the product-to-example ratio is close to 1 which means most products are used once.

There is another way to construct this leak feature that checks whether the query-product pair is in task 1 public dataset. This one will definitely fail in the private set as we cannot access that information.

Note that the evaluation service used V100 equipped with the tensor core. Transfer it to onnx fp16 help a lot for the speed. For example, our one unoptimized debertaV3-base model takes about 90 mins to do the inference with a single 2080Ti GPU locally but only 35-40 mins to do the inference online for 2 debertaV3 models (2 folds).