I’ve seen multiple people’s posts suggesting changes in the purchase phase training pipeline. This post is to clarify some details about the end of competition evaluations and explain some its implications.
First of all, thanks a lot to everyone who have provided feedback for the training pipeline. It has been very valuable to us for the challenge design.
TL;Dr - Please submit your best purchase strategies, you’ll need to select them for end of competition evaluations which will run 5 post purchase training pipelines.
Many have pointed out that the training pipeline does not provide good scores. I’ll break these down some categories.
Concerns about the best purchases not being incentivised correctly
- Model is too weak and cannot learn hard examples - This is a legitimate concern, though I do not currently know the scale to which it is applicable. We’re investigating this.
- Not using GaussianBlur in testing changes the optimal labels - Currently I have not found any strong evidence of this, but would love to discuss further.
- Model does not converge due to low epochs hence too stochastic - I did not find the stochasticity in scores to be too much and feel it is in expected practical limits. However, in the later part of this post I’ll be addressing this further.
These are the most important concerns we’re looking to resolve. Some of them are also difficult to measure, and the discussion becomes qualitative and opinionated. We want to resolve it in the most quantitative and fair way possible.
Concerns that the score is too low
- The feature layers are frozen
- The model is too small and plateaus
- GaussianBlur is not used in test
For these, I’d like to reiterate that the score is not important, rather, maximising the score by making the best purchase strategies is the goal of the competition.
End of competition Evaluations
In addition to changing the dataset for end of competition evaluations. We will run select submissions on multiple training pipelines in the post purchase phase.
The detailed steps are given below:
- Eligible teams will select two of their submissions to evaluate - Eligibility criteria to be announced soon, it will be based on Round 2 leaderboard.
- Each submission will run through the pre-train and the purchase phase on the end of competition dataset.
- The same purchased labels will be put through 5 training pipelines - Details to be released soon.
- Each training pipeline will be run for 2 seeds and scores averaged, to address any stochasticity in scores.
- To avoid issues due to difference of average scores from different training pipelines, a Borda ranking system will be used.
We hope this will inceltivize participants to select the best purchase strategies and not optimize for the current training pipeline. We’re unable to incorporate this setup during the live round due to the prohibitive cost of compute for each run.
Training pipeline survey
Please vote here for your favoured schemes. Note that is not a vote to select the training pipelines, just a survey of participant’s preferences, but we’ll take the results very seriously.
- Unfreeze feature layers in base model
- Use Gaussian Blur During Test
- Train for more epochs
- Use bigger model - Efficientnet-B7
- Train more epochs + Unfreeze feature layers in base model
- Train more epochs + Use Gaussian Blur During Test
- Remove Gaussian Blur from training
- Unfreeze feature layers in base model + Remove Gaussian Blur from training
Please feel free to suggest any other changes if I have missed them.