Experiments with “unlabelled” data

sergey_zlobin · February 19, 2022, 3:49pm

Here are just my results. I used the same model, but different purchase modes.

Train with initial 5000 images only: LB 0.869
Add 3000 random images from unlabelled dataset: 0.881
“smart” purchasing (at least non random): 0.888

So we see, that using some “smart” purchasing is helpful, but not so many, maybe ~0.01.
Probably tuning models would be more helpful to push further.

moto · February 20, 2022, 8:24am

@sergey_zlobin : Thanks for your information.

I am wondering if you tried to purchase all 10K then what the score could be.

sergey_zlobin · February 20, 2022, 8:59pm

I wrote scores from the leaderboard. I can’t check 10K there…
Local scores are a little bit higher than LB, but correlated with LB.
Yeah maybe I’ll check it locally.

sergey_zlobin · February 21, 2022, 6:46am

I’ve checked it locally.
Using all 10K images is better than my 3K choosing by 0.006. Maybe I can take some of it by changing purchasing algorithm. But still I feel I need to tune my model.

tfriedel · February 21, 2022, 1:40pm

yeah, looks like your model/training pipeline is limiting you. You should be able to get a much bigger improvement from using all labels. Maybe try a bigger model and tune it a bit.

gaurav_singhal · February 21, 2022, 1:58pm

Do you mean the reported score by @sergey_zlobin during the pre-training phase? or with the purchase phase?

tfriedel · February 21, 2022, 2:38pm

I mean the addition of 10000 vs 3000 shouldn’t result in only such a modest improvement.

gaurav_singhal · February 21, 2022, 4:31pm

Aah, yes. With the post. I think maybe there are only some ~4000-5000 images good enough for improving the system and therefore adding all 10000 doesn’t make much improvement because the basic model will predict them correctly anyway.
It’s just a theory, not tested. I haven’t touched the purchase part yet.
Thanks for your insight.

leocd · February 22, 2022, 7:33am

yes, it’s very significant.

From my experiment notebook its something like this :

exp no.	augmentation	pretrained	purchase_method	score_pretraining_phase	score_purchase_phase	score_validation_phase	LB_Score
1	NO	NO	NO	0.773	0.773	0.760
2	NO	NO	RANDOM 3000	0.773	0.804	0.760
3	NO	NO	ALL 10000	0.773	0.841	0.835
4	NO	YES	NO	0.857	0.857	0.850
5	NO	YES	RANDOM 3000	0.857	0.864	0.845	0.851
6	NO	YES	ALL 10000	0.857	0.892	0.875
7	YES	YES	NO	0.868	0.868	0.865
8	YES	YES	RANDOM 3000	0.868	0.886	0.869	0.880
9	YES	YES	ALL 10000	0.868	0.902	0.893

the notebook :