Here is a rundown of my solution that achieved 0.9 (current 5th) on the leaderboard.
Train on 5000 images (augmentation of flips and rotations, Effnetb1 pretrained)
Purchase 3000, with a heuristic primarily biasing low accuracy labels. This was chosen due to the significant difference in accuracy across the 4 classes, and works well in conjunction with pseudolabelling techniques.
Pseudolabel remaining dataset, with a threshold of 0.7 and temperature scaling. This method would achieve an accuracy of 99.5 + on the unlabelled_dataset, of approx 5000 / 7000 when all the suspected dents are purchased beforehand.
Train again!
Details can be found in attachment. run.zip (4.8 KB)
We also attempted semi-supervised methods such as MeanTeacher and FixMatch, though results were less than ideal. The performance of this naive solution was quite surprising, and I think it is predominantly due to the nature of being able to purchase all the difficult classes in the allocated set size (dents). In fact, it is very close to an upper bound experiment of 91 with the same architecture.
Some improvements if I had more time may include pseudolabelling for more rounds, and treating the question as 4 separate binary classification tasks. (Allows us to partially pseudolabel data)
I do hope other teams would be willing to share their solutions as well before part 2 begins, and that the problem specification is adjusted to promote other more complicated approaches! Clearly this naive technique is not very generalizable to other data science tasks.
Not an exact value as we slightly modified the purchase and training parameters throughout, but the scale of the boost should be in the right ballpark - 0.008 from 0.884 to 0.902 on servers.
Thanks for sharing this! I have thought about improving the minority classes too. But I wonder if spending more budget on those smaller classes is optimal for your total average accuracy. Would be interesting to run your solution with and without the part where you use the confusion matrix to influence purchase decisions. Maybe you’ve alread done this experiment?
edit: with macro-weighted f1 as a metric in round 2, this question is obsolete. It’s now clear that focusing on minority classes is important.
Buying low-accuracy labels (dents) make the most sense in this challenge, sounds easy but challenging. I have the exact same heuristic with the addition of one more policy (don’t judge by my score it is only a baseline, I didn’t submit the solution because I had too much on my plate ).
Just to give a perspective, here is the confusion matrix of dent_small and dent_large respectively.
[[7327 392]
[ 719 1562]]
[[8736 116]
[ 328 820]]
The sequence is - tn, fp, fn, tp. fp+fn for dent classes is much more than those of scratch class.
One of the approaches to deal with this can be the weighted loss function. I haven’t implemented it so can’t tell the improvement.