The new round now includes 6 classes instead of 4. Two new classes called as stray_particle and discoloration are added.
There are a few updates to the training dataset. The new dataset for Round 2 has 1000 images in the training set, 10,000 images in the unlabelled set, and 3000 images in the test set. Read details on these updates over here.
What are some key changes to the evaluation and metrics?
Evaluation is decoupled from Pre-Training Phase. In the new round, the purchased labels are evaluated independently from the models trained by the participants in the pre_training phase. This will remove the focus on the pre-training phase and allow attention to the purchase policies. You can find the complete flow over here.
Multiple Purchasing Budgets & Compute Budgets. For the new round, your submissions have to be able to perform well across multiple Purchasing Budget & Compute Budget pairs. Submissions will be evaluated based on 5 purchasing-compute budget pairs. Read more details about the purchasing and compute budget over here.
New primary metric. The new round will use the macro-weighted F1 Score.
Round 2 will run from March 3rd to April 7th, 2022 $20,000 Prize Pool is up for grabs Top 9 community contributors to the challenge will also receive very exciting gadgets!
@snehananavati : Many thanks for your update. As far I understood, we actually don’t need to provide code for the training phase and the prediction phase. The system will use the same training pipeline - B4 with 10 epochs and then generate predictions and the scores by itself.
In other words, we only need to focus on the purchase phase. Do I understand correctly ?
@moto : You will still have to train your models from scratch, then use your trained models for making your purchasing decisions. We then take the labels that you purchased, and use it along with the training set in our training pipeline to compute the final scores.
The training pipeline we introduce is just an elaborate evaluation function to “assess the quality of the purchased labels”.
Wrt the prediction_phase interfaces, you will still have to submit the same, as the evaluators do a series of integration tests to ensure that the prediction_phase interface works as expected - while we do not use this function in the current evaluation pipeline, it will allow us to do a more elaborate analysis of the submissions at a later point.
interesting rule changes! I like that you don’t have to worry about tuning the model, data augmentation, self supervised learning and many other methods.
I noticed that you train only for 10 epochs with a fixed learning rate (yes there’s lr reduce on plateau, but it will not kick in). So you have some kind of early stopping, i.e. you don’t train the model to it’s full potential.
I wonder if this could result in participants optimizing for getting good results fast and this may not be what is intended. I.e. it could be that the labels you need to add to get a good score after 10 epochs is quite different than the ones you would need if you would train the model to convergence.
In Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search the authors show evidence that when stopping training early and evaluating the model on validation data, the relative performance ranking may not correlate well with the performance ranking of the full training.
If this applies here I’m not sure. Just to be safe I’d propose to improve the training though. I.e. train it for longer, say 30-40 epochs, with a decreasing learning rate (e.g. cosine shaped) and also to not use lr reduce on plateau, because it’s a bit unpredictable.
It should be aggregated_dataset instead, otherwise none of the purchased labels have an effect! This bug may also be present in your server side evaluation scripts.
another thing:
in run.purchase_phase a dict is returned. Should it be a dict? And also is it allowed to fill in labels for indices you didn’t purchase, say for example with pseudo labeling?
In instantiate_purchased_dataset the type hint says it’s supposed to be a set, which is inconsistent and also wouldn’t work. It would in theory even be possible to return some other type in purchase_phase, which has the dict interface, i.e. supports .keys() but allows repetitions of keys. This would be some hack to increase the dataset to as many images as you want, which is surely an unwanted exploit. I suggest you convert whatever is returned by purchase_phase to a dict, and depending on if pseudo labeling is allowed or not, further validate it.
It would be good if you would test your training pipeline if it can actually achieve good scores under ideal conditions (say with buying all labels).
A gap between the scores when using the training pipeline on the full dataset and a limited budget with random purchases is indeed one of the important metrics we took into consideration while designing the latest dataset. One important goal of the rules changes is to incentivize participants to focus on improving their purchasing strategies and finding the best labels given the budget.
That said, the starter kit local evaluation does have a few critical bugs which are not there in the current evaluation setup.
training_dataset in post purchase should be aggregated_dataset as you rightly pointed out.
Batch size is set to 64 instead of 5 in the evaluator. This is probably the primary reason of bad scores on the local evaluation.
The returned array of purchase phase is not used for the post purchase. Instead, the indexes which were used when calling unlabelled_dataset.purchase_label(index) are taken and combined with the pre-training dataset.
We will fix these bugs in the starter kit and make an update.
Coming to the other points you mentioned
Feature layers being frozen - This is an unintentional mistake on my part. However, the metric of having a gap in scores between a limited buget with random purchases and the full unlabelled dataset holds true even with the layers frozen. Indeed, while the absolute scores improve with unfrozen layers, the gap remains similar or even reduces slightly. However we may consider unfreezing the layers and we’ll inform accordingly if anything changes here.
Reduce LR on Plateau - I agree with your sentiment that it is slightly unpredictable. The numbers currently set means it may trigger once during the run, though in most cases it doesn’t trigger. The intention was to provide a level playing field to all participants so they focus on finding the best data to purchase. Switching to a different learning rate schedule is not a priority at the moment, but we may add it if other major changes are taken up in the post training pipeline.
Only 10 epochs - Yes the models definitely will not fully converge with this number, but it strikes a practical balance of having a reasonable amount of compute per submission while giving consistent scores. (Especially with the batch size of 64)
P.S - An idea I’m considering is to switch down to an Efficientnet-b0 and unfreeze all the layers. You can let me know your thoughts on this.
Efficient-b0 tends to learn faster compared to its family members. Since the dataset is smaller, with 64 as batch size, unfreezing all layers would be better for b0 compared to the same config in b4, the quality of purchase depends on it, I think you may perform a small experiment with the best-unlabelled images (64 batches, 0.x LR, y epochs) and see if b0 outperforms b4.
In any case, I don’t think it matters. If all the evaluations will run on the same configuration then the performance will be equally good or bad for all the participants. However, with the above experiment, I guess you and we would be able to see if the purchase makes any sense or not.
In the first round I hit some wall with efnet b1, but didn’t with efnet b4. I.e. using active learning I got an improvement with b4, but not with b1. This is not a totally conclusive argument, but some evidence. However with frozen layers and only 10 epochs at a fixed learning rate, it’s a different situation.
A big issue I see is that the variance of the final scores seems too high and too much dependent on random seeds.
For example, with a modified starter kit (batch size=64, aggregated_dataset used) and a purchase budget of 500 which always buys the first 500 images and using different seeds I measured these f1 scores:
[0.23507449716686799, 0.17841491812405716, 0.19040294167615202, 0.17191250777735645, 0.16459303242037562]
mean: 0.188
std: 0.025
In the first round the improvements I observed with active learning were between 0.7% and 1.5%. Now if results fluctuate up to 7% just based on random seed this is pretty bad. I think the winner should not be decided based on luck or on his skill to fight random number generators.
You do run multiple runs, but even then it’s still not great I guess. Would be better to bring variance down for individual runs, as much as possible.
I guess some experiments should be run to see what improves this. Training for longer, averaging more runs, using weight averaging, not freezing layers, using efnet b1 or b0, different learning rate schedules or dropout would be some of the parameters that are worth experimenting with.
Currently, I do not see this much spread in scores when testing on the private set which is used for leaderboard. Nevertheless, thanks for pointing this out, will check further on the score spread across the budgets used.
Also note that the final end of competition evaluations will not use the dataset used on the leaderboard now, but a different dataset sampled from the same distribution. The end of competition evalutions will also feature more exhaustive evaluations with many more models. Hence overfitting the leaderboard is likely to hurt participants when the final evaluations take place. We’ll make communications about this clearer in case this isn’t properly explained.
Hi @dipam, have you already cosidered to change post training code as mentioned in the comment?
Especially for the small number of epoch seems problematic for me.
You can easily check the trained model is still underfitting for the dataset by changing number of epochs to 20 from 10 and see how your score improved.
That means for the model there is almost no need to feed data as it’s still “learning” with given data, so it might not a good model for evaluating purchased data quality.
In real situation, I guess the host would never use the way underfitted model to evaluate purchased data, that’s why I think it’s better to change or allow participants to change the post training code too.
Thanks, hope this competition would become more interesting and useful one!
welcome to any comment;)
I understand your concern about underfitting. However, this challenge is a bit non-traditional, the data is completely synthetic, and the purpose is solely the methods used for research only. The final model trained is not of importance to any real-world setting, only the algorithms you develop.
Also, the purpose of the training pipeline is to give a level playing field to all participants to that they focus the purchase strategies. Whether the final loss value is reached is not important as long as better data purchased produces better scores using the same training pipeline, which is what we’ve tried to setup with Round 2.
Your goal is to improve the score by purchasing better labels, the score may be limited by the training setup once the best labels are purchased, but that is not yet the case in my opinion.
@dipam Thanks for the comment!
I understand that round 2 tries to make us more focused on the purchase strategies.
My concern is not about how good the final F1 score is, but about the meaning of the best additional data.
In general, increasing dataset size when your model is underfitting is a common bad strategy.
The same thing can be said here, the strategy to choose “good additional data for underfitted model” is less practically meaningful than the one for overfitted model.
The easiest way to fix this issue is just change the training pipeline so that the trained model overfit to 1,000 training dataset.
I believe it would make the competition more useful and everyone can learn more interesting strategies.
Thanks for the clarification and useful explanation. I’ll consider this and try some experiments. One issue with designing the challenge is it needs balance between good training pipeline, score gap for better data, and compute constraints. All these need to be satisfied while iterating our synthetic data to match these constraints. We’ll try to improve the training pipeline accordingly if we’re able to match them in a reasonable way.
I agree. There’s now an incentive to not buy the most useful images, but images that can be learned and improve a model in the first few epochs. It would probably rule out “difficult” images. It’s quite likely that this is of little practical relevance. While for competitions sake it’s ok, it would still be good if the results here had some practical relevance.
While I appreciate if the training pipeline would be made more realistic, I hope this will not be a change implemented like a week before deadline and force us to make big changes.
Thanks, I totally understand the situation. I can imagine it’s much harder to host a competition than just to join as a competitor:)
Anyway, whether the modification would be made or not, I’ll try to do my best.
Hi, it seems theres’s a bug in local_evaluation.py.
I think you should change
time_available = COMPUTE_BUDGET - (time_started - time.time())
→
time_available = COMPUTE_BUDGET - (time.time() - time_started)