Training Dataset

Is it possible to make your training dataset using similarity with provided test set (for example picking most relevant images from products10k or others open-sourced datasets)?

I was thinking about the same but my current investigation does not show the positive impact of making Product10k more similar to the test-set).
What I have done:

  1. Select from Product10k categories which are [shoes, eyewear] as these categories represent >80% of data
  2. Train model on the such dataset:
  • Model is trained way faster as train-dataset is ~5x smaller
  • scores for validation set also rise way faster, gaining ~2% of mAP
  1. Submitted model is worse than the model trained on the whole Product10k dataset

So look like even in the description it is written Example products include sandals and sunglasses, they are not the only products. Or ratio between fashion-based images and others is different.
This is my current state of knowledge.


Thank you for reply! I still haven’t done test-based dataset, but probably some enhancements based on test distribution can adjust score.