πŸš€ Datasets Released & Submissions Open πŸš€

Dataset Released :rocket::rocket::rocket:

:rotating_light::rotating_light::rotating_light: NOTE: This post has been updated to reflect the changes to the dataset. Please use v0.2 version of the dataset starting 1st April, 2020.

The Shopping Queries Dataset for the Amazon KDD Cup 2022 - ESCI Challenge for Improving Product Search has been released.

You can access the datasets for each of the Tracks on the Resources Page.

The datasets contain the following files :

.
β”œβ”€β”€ task_1_query-product_ranking
β”‚   β”œβ”€β”€ product_catalogue-v0.2.csv.zip 
β”‚   β”œβ”€β”€ sample_submission-v0.2.csv
β”‚   β”œβ”€β”€ test_public-v0.2.csv.zip
β”‚   └── train-v0.2.csv.zip
β”œβ”€β”€ task_2_multiclass_product_classification
β”‚   β”œβ”€β”€ product_catalogue-v0.2.csv.zip
β”‚   β”œβ”€β”€ sample_submission-v0.2.csv
β”‚   β”œβ”€β”€ test_public-v0.2.csv.zip
β”‚   └── train-v0.2.csv.zip
└── task_3_product_substitute_identification
    β”œβ”€β”€ product_catalogue-v0.2.csv.zip
    β”œβ”€β”€ sample_submission-v0.2.csv
    β”œβ”€β”€ test_public-v0.2.csv.zip
    └── train-v0.2.csv.zip

The product_catalogue-v0.2.csv.zip for all the tasks has the following columns : product_id, product_title, product_description, product_bullet_point, product_brand, product_color_name, product_locale

Task 1

  • train-v0.2.csv.zip contains the following columns : query_id, query, query_locale, product_id, esci_label
  • test_public-v0.2.csv.zip contains the following columns : query_id, query, query_locale, product_id
  • sample_submission-v0.2.csv.zip contains the following columns : query_id, product_id

Task 2

  • train-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale, esci_label
  • test_public-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale
  • sample_submission-v0.2.csv.zip contains the following columns : example_id, esci_label

Task 3

  • train-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale, substitute_label
  • test_public-v0.2.csv.zip contains the following columns : example_id, query, product_id, query_locale
  • sample_submission-v0.2.csv.zip contains the following columns : example_id, substitute_label

Download via CLI [more commands]

aicrowd datasets download -c esci-challenge-for-improving-product-search

# if you don't have AIcrowd CLI installed
pip install -U aicrowd-cli

Submissions :rocket:

You can make the submissions by clicking on the Create Submission button on the challenge page. Please do remember to select the correct Task from the drop down before submitting.

The Create Submission button is only accessible after you accept the challenge rules by clicking on the Participate button.

We very much recommend making a first submission using the included sample_submission files for each of the tracks.

Best of Luck !

5 Likes

I struggle to download files. Looks like to be huge unzipped CSV files and when I click on download it does not download but open an Excel online view of the file, then I need to download.

Works for small files, but struggle with product_catalogue :frowning:

Am I the only one? Am I doing something wrong?

Best,
Jacques

Hi @jacques_peeters, it’s weird, and thanks to let us know. :raised_hands:

Let us check and upload compressed versions as well asap. Meanwhile, you can use β€œSave link as” in case your browser isn’t saving the file by default.

I should have been smarter and try the β€œSave Link As…”, thank you :slight_smile:

Maybe it is a chrome extension on my side :thinking:

Hi guys,

Anyone know how to download the dataset by wget way ? not through the download button. e.g.
wget AIcrowd
?

Just click the download button, and copy the link from the browser download content page, it should be a aws link.

it is too slowly to download

what is the maximum number of submissions allowed for one task?

Hi @running, @good-good-study, others,

You can also use AIcrowd CLI to download the datasets, if you prefer terminal approach. :computer:

For listing all the files for this challenge:

❯ aicrowd dataset list --challenge esci-challenge-for-improving-product-search

                                     Datasets for challenge #1031
β”Œβ”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ #  β”‚ Title                                                                β”‚ Description β”‚      Size β”‚
β”œβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0  β”‚ Task 1: Query Product Ranking/product_catalogue-v0.1.csv             β”‚ -           β”‚   1.06 GB β”‚
β”‚ 1  β”‚ Task 1: Query Product Ranking/sample_submission-v0.1.csv             β”‚ -           β”‚   1.23 MB β”‚
β”‚ 2  β”‚ Task 1: Query Product Ranking/test_public-v0.1.csv                   β”‚ -           β”‚ 247.81 KB β”‚
β”‚ 3  β”‚ Task 1: Query Product Ranking/train-v0.1.csv                         β”‚ -           β”‚  42.20 MB β”‚
β”‚ 4  β”‚ Task 2: Multiclass Product Classification/product_catalogue-v0.1.csv β”‚ -           β”‚   2.13 GB β”‚
β”‚ 5  β”‚ Task 2: Multiclass Product Classification/sample_submission-v0.1.csv β”‚ -           β”‚   7.00 MB β”‚
β”‚ 6  β”‚ Task 2: Multiclass Product Classification/test_public-v0.1.csv       β”‚ -           β”‚  17.86 MB β”‚
β”‚ 7  β”‚ Task 2: Multiclass Product Classification/train-v0.1.csv             β”‚ -           β”‚  96.16 MB β”‚
β”‚ 8  β”‚ Task 3: Product Substitute Identification/product_catalogue-v0.1.csv β”‚ -           β”‚   2.13 GB β”‚
β”‚ 9  β”‚ Task 3: Product Substitute Identification/sample_submission-v0.1.csv β”‚ -           β”‚   8.09 MB β”‚
β”‚ 10 β”‚ Task 3: Product Substitute Identification/test_public-v0.1.csv       β”‚ -           β”‚  17.86 MB β”‚
β”‚ 11 β”‚ Task 3: Product Substitute Identification/train-v0.1.csv             β”‚ -           β”‚ 106.44 MB β”‚
β””β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For downloading all the files:

❯ aicrowd dataset download --challenge esci-challenge-for-improving-product-search

For downloading selected files:

# Using wildcard or file names, below will download all the Task 1's files
❯ aicrowd dataset download --challenge esci-challenge-for-improving-product-search "Task 1*"

# Using ID given in the table during listing
❯ aicrowd dataset download --challenge esci-challenge-for-improving-product-search 1

And obviously to install AIcrowd CLI :wink:, you can do:

❯ pip install -U aicrowd-cli

Welcome to KDD Cup and hoping for your submissions soon! :rocket:

3 Likes

Hi @shuliang, the files are hosted on S3 so it is unlikely that there is issue on the server side.
Can you try changing your internet ISP in case it is throttling the download speed? :cry:

We will release compressed versions too soon for making it more accessible.
Please let all your feedbacks come in! :hugs:

Hi @Roundrobin, you can submit 5 submissions/task/day/team.

HI, @shivam for task-1, i see queryids are overlapping, is it expected? see attached screenshot for your reference!
Text of query_id=0 is overlapping in query_id=1 and so on
Screenshot 2022-03-29 at 4.29.08 PM

1 Like

Hi @shreyansdhankhar, we are looking into it at the moment, stay tuned for the updates! :innocent:

Cool, will wait for the update then !

@shivam: for each query do we need to recommend the top-10 product ids in Task-1?

Also, on the test set for task 1 there are many duplicates

2 Likes

This should be addressed in the v2.0 release of the dataset.

1 Like

Hi @mohanty @shivam ,
I have some questions about the timeline and rules.

  1. Entry Deadline: July 15, 2022 at 00:00:00 UTC. I’d like to know the exact meaning. Which is correct? Submit by July 14 at 23:59:59 or by July 15 at 23:59:59?
  2. Can we use the other task’s dataset? E.g., make a model for task 1 by training with task 1-3 datasets.
  3. Can we use some external data?
  4. Can we use some public pre-trained model?
  5. When admin or AIcrowd platform runs inference against the test dataset by using my submission (code and model), is there a limitation about computation time?
5 Likes