🚀 Datasets Released & Submissions Open 🚀

mohanty · March 28, 2022, 5:32pm

Dataset Released

NOTE: This post has been updated to reflect the changes to the dataset. Please use v0.2 version of the dataset starting 1st April, 2020.

The Shopping Queries Dataset for the Amazon KDD Cup 2022 - ESCI Challenge for Improving Product Search has been released.

You can access the datasets for each of the Tracks on the Resources Page.

The datasets contain the following files :

.
├── task_1_query-product_ranking
│   ├── product_catalogue-v0.2.csv.zip 
│   ├── sample_submission-v0.2.csv
│   ├── test_public-v0.2.csv.zip
│   └── train-v0.2.csv.zip
├── task_2_multiclass_product_classification
│   ├── product_catalogue-v0.2.csv.zip
│   ├── sample_submission-v0.2.csv
│   ├── test_public-v0.2.csv.zip
│   └── train-v0.2.csv.zip
└── task_3_product_substitute_identification
    ├── product_catalogue-v0.2.csv.zip
    ├── sample_submission-v0.2.csv
    ├── test_public-v0.2.csv.zip
    └── train-v0.2.csv.zip

The product_catalogue-v0.2.csv.zip for all the tasks has the following columns : product_id, product_title, product_description, product_bullet_point, product_brand, product_color_name, product_locale

Task 1

train-v0.2.csv.zip contains the following columns : query_id, query, query_locale, product_id, esci_label
test_public-v0.2.csv.zip contains the following columns : query_id, query, query_locale, product_id
sample_submission-v0.2.csv.zip contains the following columns : query_id, product_id

Task 2

train-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale, esci_label
test_public-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale
sample_submission-v0.2.csv.zip contains the following columns : example_id, esci_label

Task 3

train-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale, substitute_label
test_public-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale
sample_submission-v0.2.csv.zip contains the following columns : example_id, substitute_label

Download via CLI [more commands]

aicrowd datasets download -c esci-challenge-for-improving-product-search

# if you don't have AIcrowd CLI installed
pip install -U aicrowd-cli

Submissions

You can make the submissions by clicking on the Create Submission button on the challenge page. Please do remember to select the correct Task from the drop down before submitting.

The Create Submission button is only accessible after you accept the challenge rules by clicking on the Participate button.

We very much recommend making a first submission using the included sample_submission files for each of the tracks.

Best of Luck !

jacques_peeters · March 28, 2022, 11:30pm

I struggle to download files. Looks like to be huge unzipped CSV files and when I click on download it does not download but open an Excel online view of the file, then I need to download.

Works for small files, but struggle with product_catalogue

Am I the only one? Am I doing something wrong?

Best,
Jacques

shivam · March 28, 2022, 11:34pm

Hi @jacques_peeters, it’s weird, and thanks to let us know.

Let us check and upload compressed versions as well asap. Meanwhile, you can use “Save link as” in case your browser isn’t saving the file by default.

jacques_peeters · March 28, 2022, 11:40pm

I should have been smarter and try the “Save Link As…”, thank you

Maybe it is a chrome extension on my side

zeromq · March 29, 2022, 2:35am

Hi guys,

Anyone know how to download the dataset by wget way ? not through the download button. e.g.
wget AIcrowd
?

good-good-study · March 29, 2022, 4:19am

Just click the download button, and copy the link from the browser download content page, it should be a aws link.

shuliang · March 29, 2022, 6:29am

it is too slowly to download

Roundrobin · March 29, 2022, 6:29am

what is the maximum number of submissions allowed for one task?

shivam · March 29, 2022, 7:31am

Hi @zeromq, @good-good-study, others,

You can also use AIcrowd CLI to download the datasets, if you prefer terminal approach.

For listing all the files for this challenge:

❯ aicrowd dataset list --challenge esci-challenge-for-improving-product-search

                                     Datasets for challenge #1031
┌────┬──────────────────────────────────────────────────────────────────────┬─────────────┬───────────┐
│ #  │ Title                                                                │ Description │      Size │
├────┼──────────────────────────────────────────────────────────────────────┼─────────────┼───────────┤
│ 0  │ Task 1: Query Product Ranking/product_catalogue-v0.1.csv             │ -           │   1.06 GB │
│ 1  │ Task 1: Query Product Ranking/sample_submission-v0.1.csv             │ -           │   1.23 MB │
│ 2  │ Task 1: Query Product Ranking/test_public-v0.1.csv                   │ -           │ 247.81 KB │
│ 3  │ Task 1: Query Product Ranking/train-v0.1.csv                         │ -           │  42.20 MB │
│ 4  │ Task 2: Multiclass Product Classification/product_catalogue-v0.1.csv │ -           │   2.13 GB │
│ 5  │ Task 2: Multiclass Product Classification/sample_submission-v0.1.csv │ -           │   7.00 MB │
│ 6  │ Task 2: Multiclass Product Classification/test_public-v0.1.csv       │ -           │  17.86 MB │
│ 7  │ Task 2: Multiclass Product Classification/train-v0.1.csv             │ -           │  96.16 MB │
│ 8  │ Task 3: Product Substitute Identification/product_catalogue-v0.1.csv │ -           │   2.13 GB │
│ 9  │ Task 3: Product Substitute Identification/sample_submission-v0.1.csv │ -           │   8.09 MB │
│ 10 │ Task 3: Product Substitute Identification/test_public-v0.1.csv       │ -           │  17.86 MB │
│ 11 │ Task 3: Product Substitute Identification/train-v0.1.csv             │ -           │ 106.44 MB │
└────┴──────────────────────────────────────────────────────────────────────┴─────────────┴───────────┘

For downloading all the files:

❯ aicrowd dataset download --challenge esci-challenge-for-improving-product-search

For downloading selected files:

# Using wildcard or file names, below will download all the Task 1's files
❯ aicrowd dataset download --challenge esci-challenge-for-improving-product-search "Task 1*"

# Using ID given in the table during listing
❯ aicrowd dataset download --challenge esci-challenge-for-improving-product-search 1

And obviously to install AIcrowd CLI , you can do:

❯ pip install -U aicrowd-cli

Welcome to KDD Cup and hoping for your submissions soon!

shivam · March 29, 2022, 7:34am

Hi @shuliang, the files are hosted on S3 so it is unlikely that there is issue on the server side.
Can you try changing your internet ISP in case it is throttling the download speed?

We will release compressed versions too soon for making it more accessible.
Please let all your feedbacks come in!

shivam · March 29, 2022, 7:36am

Hi @Roundrobin, you can submit 5 submissions/task/day/team.

shreyansdhankhar · March 29, 2022, 11:02am

HI, @shivam for task-1, i see queryids are overlapping, is it expected? see attached screenshot for your reference!
Text of query_id=0 is overlapping in query_id=1 and so on
Screenshot 2022-03-29 at 4.29.08 PM

shivam · March 29, 2022, 11:04am

Hi @shreyansdhankhar, we are looking into it at the moment, stay tuned for the updates!

shreyansdhankhar · March 29, 2022, 11:06am

Cool, will wait for the update then !

shreyansdhankhar · March 29, 2022, 1:04pm

@shivam: for each query do we need to recommend the top-10 product ids in Task-1?

vitor_amancio_jerony · March 29, 2022, 6:49pm

Also, on the test set for task 1 there are many duplicates

mohanty · April 1, 2022, 12:29am

This should be addressed in the v2.0 release of the dataset.

ystkin · April 6, 2022, 1:58am

Hi @mohanty @shivam ,
I have some questions about the timeline and rules.

Entry Deadline: July 15, 2022 at 00:00:00 UTC. I’d like to know the exact meaning. Which is correct? Submit by July 14 at 23:59:59 or by July 15 at 23:59:59?
Can we use the other task’s dataset? E.g., make a model for task 1 by training with task 1-3 datasets.
Can we use some external data?
Can we use some public pre-trained model?
When admin or AIcrowd platform runs inference against the test dataset by using my submission (code and model), is there a limitation about computation time?

olivertautz · July 7, 2022, 2:56pm

Hi! I want to make a code submission but it says ’ I am not authorized to access this page’ … you say i need to accept the challenge rules by clicking on the Participate button but i cant find it. Could you help me find it?

shivam · July 8, 2022, 6:02pm

Hi @olivertautz,

You can find the participate button on the challenge page.
https://www.aicrowd.com/challenges/esci-challenge-for-improving-product-search