Size of Datasets

sergey_zlobin · February 17, 2022, 7:48am

Hello!
During submission sizes of datasets are only 100 (both training dataset and unlabelled dataset).
Probably it is the debug version.
Is it intentionally?

gaurav_singhal · February 17, 2022, 8:00am

AICrowd has their own debug dataset on which they test if your pipeline is working fine. If your code runs well enough on their debug set then they proceed with their actual training, purchase, prediction datasets.
Don’t worry, your submission does not take into account datasets you have on your system. AICrowd only uses run.py file to make an instance of class ZEWDPCBaseRun and pass their own dataset.

sergey_zlobin · February 17, 2022, 8:02am

Ahh… I see so AICrowd runs the whole pipeline twice, and I can see logs only from the debug version.
Great, thanks!

gaurav_singhal · February 17, 2022, 8:05am

Yep. Regarding logs, you can only see the time required at each step (on actual datasets). I suppose there is some sort of security reasons why they don’t show the number of images your model trained and evaluated.

leocd · February 17, 2022, 8:09am

what’s your comment about this @shivam ?
I think it’s 5000 training images, 3000 to purchase, 3000 to test right? just as in the overview.

shivam · February 17, 2022, 9:23am

Yes, that’s correct.
As Gaurav shared the logs are from the validation phase-only which has a lower image count.

In the actual run:

5k training images
10k unlabelled image (3k can be purchased)
3k testing images

Quick viz:

tfriedel · February 17, 2022, 1:46pm

what assumptions are we allowed to make about the distribution of labels? are they the same for training, unlabelled and testing?

shivam · February 19, 2022, 11:19am

It can be different.

From challenge’s introduction:

Third, data has to be acquired under the assumption of being valuable out-of-sample. Distribution shifts have to be anticipated.