Size of Datasets

Hello!
During submission sizes of datasets are only 100 (both training dataset and unlabelled dataset).
Probably it is the debug version.
Is it intentionally?

AICrowd has their own debug dataset on which they test if your pipeline is working fine. If your code runs well enough on their debug set then they proceed with their actual training, purchase, prediction datasets.
Don’t worry, your submission does not take into account datasets you have on your system. AICrowd only uses run.py file to make an instance of class ZEWDPCBaseRun and pass their own dataset.

1 Like

Ahh… I see so AICrowd runs the whole pipeline twice, and I can see logs only from the debug version.
Great, thanks!

Yep. Regarding logs, you can only see the time required at each step (on actual datasets). I suppose there is some sort of security reasons why they don’t show the number of images your model trained and evaluated.

1 Like

what’s your comment about this @shivam ?
I think it’s 5000 training images, 3000 to purchase, 3000 to test right? just as in the overview.

Yes, that’s correct.
As Gaurav shared the logs are from the validation phase-only which has a lower image count.


In the actual run:

5k training images
10k unlabelled image (3k can be purchased)
3k testing images

Quick viz:

3 Likes

what assumptions are we allowed to make about the distribution of labels? are they the same for training, unlabelled and testing?

It can be different.

From challenge’s introduction:

Third, data has to be acquired under the assumption of being valuable out-of-sample. Distribution shifts have to be anticipated.

1 Like