@moto : No, unfortunately you still cannot check in your pre-trained models. The rules around pre-trained models stay the same as it was in Round 1.
interesting rule changes! I like that you don’t have to worry about tuning the model, data augmentation, self supervised learning and many other methods.
I noticed that you train only for 10 epochs with a fixed learning rate (yes there’s lr reduce on plateau, but it will not kick in). So you have some kind of early stopping, i.e. you don’t train the model to it’s full potential.
I wonder if this could result in participants optimizing for getting good results fast and this may not be what is intended. I.e. it could be that the labels you need to add to get a good score after 10 epochs is quite different than the ones you would need if you would train the model to convergence.
In Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search the authors show evidence that when stopping training early and evaluating the model on validation data, the relative performance ranking may not correlate well with the performance ranking of the full training.
If this applies here I’m not sure. Just to be safe I’d propose to improve the training though. I.e. train it for longer, say 30-40 epochs, with a decreasing learning rate (e.g. cosine shaped) and also to not use lr reduce on plateau, because it’s a bit unpredictable.
I also noticed, the feature layers are frozen during training of the efnet4 model. Is that intentional? Seems like this will guarantee low scores.
bug In local_evaluation.py in the post purchase training phase:
trainer.train( training_dataset, num_epochs=10, validation_percentage=0.1, batch_size=5 )
It should be aggregated_dataset instead, otherwise none of the purchased labels have an effect! This bug may also be present in your server side evaluation scripts.
in run.purchase_phase a dict is returned. Should it be a dict? And also is it allowed to fill in labels for indices you didn’t purchase, say for example with pseudo labeling?
In instantiate_purchased_dataset the type hint says it’s supposed to be a set, which is inconsistent and also wouldn’t work. It would in theory even be possible to return some other type in purchase_phase, which has the dict interface, i.e. supports .keys() but allows repetitions of keys. This would be some hack to increase the dataset to as many images as you want, which is surely an unwanted exploit. I suggest you convert whatever is returned by purchase_phase to a dict, and depending on if pseudo labeling is allowed or not, further validate it.
It would be good if you would test your training pipeline if it can actually achieve good scores under ideal conditions (say with buying all labels).
A gap between the scores when using the training pipeline on the full dataset and a limited budget with random purchases is indeed one of the important metrics we took into consideration while designing the latest dataset. One important goal of the rules changes is to incentivize participants to focus on improving their purchasing strategies and finding the best labels given the budget.
That said, the starter kit local evaluation does have a few critical bugs which are not there in the current evaluation setup.
training_datasetin post purchase should be
aggregated_datasetas you rightly pointed out.
- Batch size is set to 64 instead of 5 in the evaluator. This is probably the primary reason of bad scores on the local evaluation.
- The returned array of purchase phase is not used for the post purchase. Instead, the indexes which were used when calling
unlabelled_dataset.purchase_label(index)are taken and combined with the pre-training dataset.
We will fix these bugs in the starter kit and make an update.
Coming to the other points you mentioned
- Feature layers being frozen - This is an unintentional mistake on my part. However, the metric of having a gap in scores between a limited buget with random purchases and the full unlabelled dataset holds true even with the layers frozen. Indeed, while the absolute scores improve with unfrozen layers, the gap remains similar or even reduces slightly. However we may consider unfreezing the layers and we’ll inform accordingly if anything changes here.
- Reduce LR on Plateau - I agree with your sentiment that it is slightly unpredictable. The numbers currently set means it may trigger once during the run, though in most cases it doesn’t trigger. The intention was to provide a level playing field to all participants so they focus on finding the best data to purchase. Switching to a different learning rate schedule is not a priority at the moment, but we may add it if other major changes are taken up in the post training pipeline.
- Only 10 epochs - Yes the models definitely will not fully converge with this number, but it strikes a practical balance of having a reasonable amount of compute per submission while giving consistent scores. (Especially with the batch size of 64)
P.S - An idea I’m considering is to switch down to an Efficientnet-b0 and unfreeze all the layers. You can let me know your thoughts on this.
Efficient-b0 tends to learn faster compared to its family members. Since the dataset is smaller, with 64 as batch size, unfreezing all layers would be better for b0 compared to the same config in b4, the quality of purchase depends on it, I think you may perform a small experiment with the best-unlabelled images (64 batches, 0.x LR, y epochs) and see if b0 outperforms b4.
In any case, I don’t think it matters. If all the evaluations will run on the same configuration then the performance will be equally good or bad for all the participants. However, with the above experiment, I guess you and we would be able to see if the purchase makes any sense or not.
In the first round I hit some wall with efnet b1, but didn’t with efnet b4. I.e. using active learning I got an improvement with b4, but not with b1. This is not a totally conclusive argument, but some evidence. However with frozen layers and only 10 epochs at a fixed learning rate, it’s a different situation.
A big issue I see is that the variance of the final scores seems too high and too much dependent on random seeds.
For example, with a modified starter kit (batch size=64, aggregated_dataset used) and a purchase budget of 500 which always buys the first 500 images and using different seeds I measured these f1 scores:
[0.23507449716686799, 0.17841491812405716, 0.19040294167615202, 0.17191250777735645, 0.16459303242037562]
In the first round the improvements I observed with active learning were between 0.7% and 1.5%. Now if results fluctuate up to 7% just based on random seed this is pretty bad. I think the winner should not be decided based on luck or on his skill to fight random number generators.
You do run multiple runs, but even then it’s still not great I guess. Would be better to bring variance down for individual runs, as much as possible.
I guess some experiments should be run to see what improves this. Training for longer, averaging more runs, using weight averaging, not freezing layers, using efnet b1 or b0, different learning rate schedules or dropout would be some of the parameters that are worth experimenting with.
Here’s a paper I just googled (haven’t read it yet) about this issue:
ACCOUNTING FOR VARIANCE IN MACHINE LEARNING BENCHMARKS
And another one:
Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision
Currently, I do not see this much spread in scores when testing on the private set which is used for leaderboard. Nevertheless, thanks for pointing this out, will check further on the score spread across the budgets used.
Also note that the final end of competition evaluations will not use the dataset used on the leaderboard now, but a different dataset sampled from the same distribution. The end of competition evalutions will also feature more exhaustive evaluations with many more models. Hence overfitting the leaderboard is likely to hurt participants when the final evaluations take place. We’ll make communications about this clearer in case this isn’t properly explained.
Hi @dipam, have you already cosidered to change post training code as mentioned in the comment?
Especially for the small number of epoch seems problematic for me.
You can easily check the trained model is still underfitting for the dataset by changing number of epochs to 20 from 10 and see how your score improved.
That means for the model there is almost no need to feed data as it’s still “learning” with given data, so it might not a good model for evaluating purchased data quality.
In real situation, I guess the host would never use the way underfitted model to evaluate purchased data, that’s why I think it’s better to change or allow participants to change the post training code too.
Thanks, hope this competition would become more interesting and useful one!
welcome to any comment;)
I understand your concern about underfitting. However, this challenge is a bit non-traditional, the data is completely synthetic, and the purpose is solely the methods used for research only. The final model trained is not of importance to any real-world setting, only the algorithms you develop.
Also, the purpose of the training pipeline is to give a level playing field to all participants to that they focus the purchase strategies. Whether the final loss value is reached is not important as long as better data purchased produces better scores using the same training pipeline, which is what we’ve tried to setup with Round 2.
Your goal is to improve the score by purchasing better labels, the score may be limited by the training setup once the best labels are purchased, but that is not yet the case in my opinion.
@dipam Thanks for the comment!
I understand that round 2 tries to make us more focused on the purchase strategies.
My concern is not about how good the final F1 score is, but about the meaning of the best additional data.
In general, increasing dataset size when your model is underfitting is a common bad strategy.
The same thing can be said here, the strategy to choose “good additional data for underfitted model” is less practically meaningful than the one for overfitted model.
The easiest way to fix this issue is just change the training pipeline so that the trained model overfit to 1,000 training dataset.
I believe it would make the competition more useful and everyone can learn more interesting strategies.
Thanks for the clarification and useful explanation. I’ll consider this and try some experiments. One issue with designing the challenge is it needs balance between good training pipeline, score gap for better data, and compute constraints. All these need to be satisfied while iterating our synthetic data to match these constraints. We’ll try to improve the training pipeline accordingly if we’re able to match them in a reasonable way.
I agree. There’s now an incentive to not buy the most useful images, but images that can be learned and improve a model in the first few epochs. It would probably rule out “difficult” images. It’s quite likely that this is of little practical relevance. While for competitions sake it’s ok, it would still be good if the results here had some practical relevance.
While I appreciate if the training pipeline would be made more realistic, I hope this will not be a change implemented like a week before deadline and force us to make big changes.
Thanks, I totally understand the situation. I can imagine it’s much harder to host a competition than just to join as a competitor:)
Anyway, whether the modification would be made or not, I’ll try to do my best.
Hi, it seems theres’s a bug in local_evaluation.py.
I think you should change
time_available = COMPUTE_BUDGET - (time_started - time.time())
time_available = COMPUTE_BUDGET - (time.time() - time_started)
Thanks for pointing it out, I have updated it in the starter kit.
Hi @dipam, is there any update about this?
Or, please let me know if you already decided to stick to current training pipeline. I’ll try to optimize my purchase strategy to the one.
Seems I missed your message. The point on ruling out the most useful images, do you feel its a huge issue with the current training pipeline?
We do want the solutions to be as agnostic to the training pipeline while buying the best images possible. Yes its not completely possible to make things training agnostic, but that is the spirit of the competition we’d like to promote. If you’re finding that you’re deliberately having to remove too much of the useful images, please let me know.
The question is how you define the best or useful images. If it’s the best for improving 10 epochs effnet-b4(which I suspect underfitting), the current scheme makes sense.
But in practice, I geuss people would decide to add data after trying to Improve the model with the current data and finding the performance still doesn’t reach to the expected one.
So my definition of “useful” here is “useful to improve the performance of well enough finetuned model”. And I suspect current post training pipeline doesn’t reach to the level, IMHO.
Yes what you say makes sense, although on the other end a very strong model was getting nearly as good as “all label purchase” scores with just random purchase, so the dataset needed more difficulty, important lessons learnt. In any case, I agree with your definition of useful, for now we’ve come up with the end of competition evaluations scheme. Please check this recent post.