Training Dataset

george · January 31, 2023, 8:29am

Is it possible to make your training dataset using similarity with provided test set (for example picking most relevant images from products10k or others open-sourced datasets)?

bartosz_ludwiczuk · January 31, 2023, 2:41pm

I was thinking about the same but my current investigation does not show the positive impact of making Product10k more similar to the test-set).
What I have done:

Select from Product10k categories which are [shoes, eyewear] as these categories represent >80% of data
Train model on the such dataset:

Model is trained way faster as train-dataset is ~5x smaller
scores for validation set also rise way faster, gaining ~2% of mAP

Submitted model is worse than the model trained on the whole Product10k dataset

So look like even in the description it is written Example products include sandals and sunglasses, they are not the only products. Or ratio between fashion-based images and others is different.
This is my current state of knowledge.

george · February 6, 2023, 5:20pm

Thank you for reply! I still haven’t done test-based dataset, but probably some enhancements based on test distribution can adjust score.