📥 Guidelines For Using External Dataset

Yes, attaching a public link would be sufficient. If the pre-trained models are available in popular libraries like timm or torchvision, a link to them will also suffice.

Hi @snehananavati

Is this mean we need to publish the pre-trained models we used? Even though they are publically available?

I thought we need to publish the datasets or models if they are not publically available. For example, if someone collects data via web scrapping or someone uses a pre-trained model that is not publically available.

How can i post this link of dataset. I can’t see any link of form you attach on this post ?

All dataset link i used
  1. https://www.pinlandata.com/rp2k_dataset/
  2. DeepFashion Database
  3. Shopee - Price Match Guarantee | Kaggle
  4. Alibaba goods dataset | Kaggle

@long_nguyen_hoang are we allowed to use shopee? its licensed as competition only on kaggle

I’m not sure. But i saw the rule of this dataset is “Competition Use Only”, and we are in competition too. So i think we are allowed to use it :wink:

Datasets that we used for transfer lerning:

Shopee
MET Artwork Dataset
Alibaba goods
H&M Personalized Fashion
GPR1200
Deep Fashion - Consumer-to-shop Clothes Retrieval Benchmark part
DyML Product
Stanfords Online Products
Our custom dataset from web scraping

Models only from GitHub - huggingface/pytorch-image-models: PyTorch image models, scripts, pretrained weights -- ResNet, ResNeXT, EfficientNet, EfficientNetV2, NFNet, Vision Transformer, MixNet, MobileNet-V3/V2, RegNet, DPN, CSPNet, and more

I used following datasets:

  1. https://products-10k.github.io/
  2. SOP

And will use:
Amazon
AliProducts
in the future

Link to DyML Product doesn’t work for me.
Also web scraping is a very sensitive subject since you need to ensure that every image comes with Creative Commons license.
As mentioned above Shopee dataset was allowed only for that Kaggle competition.

Hmm, indeed the link is not working, but two months ago it worked. Perhaps the organizers will tell you what to do in such a situation.

I’m using the following dataset:

Pretrained models:

At the bottom of the main page of CVPR 2021 AliProducts Challenge: Large-scale Product Recognition_算法大赛_天池大赛-阿里云天池 it says “If you find the dataset is helpful in your research, please consider citing our paper:” which implies it can be used for research purposes. One can download the dataset via links in AI_Product-Competition/get_dataset_AiProducts.sh at 4464263490143d0376a344520b0717ee91b0ff7b · pengxiaoxiao/AI_Product-Competition · GitHub.
@snehananavati Is it ok to use?

I used the following datasets:

And pre-trained models:

I used the following datasets:
Products10k
Deep fashion
Amazon dataset
Alibaba goods
Models:
timm

I used products10k and shopee dataset.

Also pretrained clip models of higgingface.

Hi, we used the Vgg16 pre-trained model on the ImageNet dataset. The subset of the dataset is available at ImageNet Object Localization Challenge | Kaggle

I’m using Laion Clip Model with Product10k dataset

we have tried:
rp2k
JD_Products_10K
Shopee
Aliproducts
DeepFashion_CTS
DeepFashion2
Fashion_200K
Stanford_Products

right now we are using only Products_10K, and models from OpenClip.

Hi @dipam and @snehananavati

I have a question about external datasets. I understand that for training models, images need to have a creative commons license, such as the Kaggle Competition Datasets (e.g., Shopee). However, these datasets are easily accessible to others and often used by people in their personal projects or papers, even if they didn’t participate in the competition. I’m curious about how strict the organizer will be regarding the use of external data. Private data or datasets with broken download links are not acceptable because they are not accessible to the public. However, for datasets that are easily accessible, such as those found in Kaggle Competition, I believe they should be acceptable for use. :slight_smile:

1 Like

This dataset doesn’t have annotation file

This data file doesn’t have any annotation.