Basic Solution
All my models are based on the infoxlm large model. I concat the training set of Task1 and task2 as a new training set after de duplication. Then all next three tasks use the same model trained on the new training set. Finally, I used 8 models on Task1 and 4 models on task2 and task3.
The output of the model can be submitted to different tasks after different processing:
- 
Task1: order the product by \hat P_{exact} +\hat P_ {substitute} * 0.1 + \hat P_ {completion} * 0.01. The class weight is the gain of four labels. 
- 
Task2: take the label with the highest prediction probability as the prediction result 
- 
Task3: check whether \hat P_{substitute} is greater than 0.5 and the prediction result is obtained 
Keywords of Query
The query is short text, which is very unfavorable for understanding the meaning of query. Therefore, I take the titles of all products corresponding to the query as a document, and then use TFIDF to extract keywords. Also, I get the keywords of product_bullet_point and product_description for each query. In this way, the extracted keywords can be used as the feature of query. And it can be put into the input text.
In addition, I also add the brand and color names of all products with the same query to the model. (In intuition, if there is a word in query that represents a brand, but we don’t clearly point it out, it will affect the prediction results of some goods that are not of this brand.)
This idea has a great gain for task2 and task3 models, I get a improvement of more than 0.01 from it (task2). With this idea, my task2 score of single model at public leaderboard is 0.821 (without post-processing).
But I don’t get much gain in task1. I think it is because Task1 focuses on the ordering of different products with the same query, so the features of query are not important, and the features of products are more important.
Self Distillation
I get the prediction probability on the whole training set through 10-fold cross validation on the training data, and take the mean of prediction probability and the true label probability as a soft label, and then use this soft label for model training. For example, suppose the prediction probability of one sample is (0.4, 0.3, 0.2, 0.1), and the true label is 0, the we have a soft label (0.7, 0.15, 0.1, 0.05), and then I use it for model training.
Such an approach can significantly enhance the robustness of the model and overcome the impact of noise data. With with this approach, my task2 score of single model at public leaderboard is 0.824 (without post-processing).
However, this approach will affect the effect of model ensemble. Using four models can only improve the result to 0.826. If there are we use many models, this method does not seem to bring significant gain.
Post Processing
In the last several days of the competition, I found that the threshold has a great impact on task3, and further found that task2 score can also be significantly improved by increasing the probability of special label. After exploring, I think there may be two part of marking data, one of which is task1 data, and all of the data is used as task2 and task3 data. In this way, after the leak is removed from the test set of task2, the distribution of the data set will change significantly, so that we can improve the score through post-processing. After discovering this, I improved my score on task2 to 0.830 through simple post-processing rules.
Later, I used a lightgbm model to replace the manual design post-processing rules, and added the feature of the sample index (the data is not shuffled, which is a small leak) and the feature of whether the sample appeared in the Task1 public test set. This improve my score to 0.832.
External data
I crawled the titles and comments of the products in English, Spanish, Japanese and Chinese, as well as the pictures of the goods from Amazon. But maybe because I used them in a wrong way, I only get a gain of 0.001 through the crawled title. I may publish these data later. Welcome to explore how comment data and image data can help improve search ranking.
Model acceleration
- 
pytorch amp 
- 
I read 1024 samples from the dataloder at a time, and order them according to the number of non padded tokens. Then the 1024 samples are splited into 16 pieces. In this way, shorter texts can have shorter prediction time. 
- 
For model ensemble, not all models need to make complete predictions. For example, suppose we have four models, and the mean prediction probability of the first three model is (0.7, 0.1, 0.1, 0.1). Then the fourth model does not need to predict this sample, because its prediction results can not change the final prediction anyway. Even if the prediction probability of the fourth model for this sample is(0.0, 1.0, 0.0, 0.0), the mean prediction probability of the four models is still(0.525, 0.325, 0.075, 0.075), and the final prediction result is still the first label. Based on this idea, we can reduce many unnecessary predictions in the prediction of the third and fourth models.