Is it possible to make your training dataset using similarity with provided test set (for example picking most relevant images from products10k or others open-sourced datasets)?
I was thinking about the same but my current investigation does not show the positive impact of making Product10k more similar to the test-set).
What I have done:
- Select from Product10k categories which are [shoes, eyewear] as these categories represent >80% of data
- Train model on the such dataset:
- Model is trained way faster as train-dataset is ~5x smaller
- scores for validation set also rise way faster, gaining ~2% of mAP
- Submitted model is worse than the model trained on the whole Product10k dataset
So look like even in the description it is written Example products include sandals and sunglasses
, they are not the only products. Or ratio between fashion-based images and others is different.
This is my current state of knowledge.
4 Likes
Thank you for reply! I still haven’t done test-based dataset, but probably some enhancements based on test distribution can adjust score.