The 'product_id' field is not unique in the product table

LYZD-fintech · March 29, 2022, 4:36pm

The ‘product_id’ field is not unique in the product table, which leads to ‘train-v0.1.csv’ merge ‘product_catalogue-v0.1.csv’ will make the number of rows in the result data exceed the number of lines in the original ‘train-v0.1.csv’. Simple analysis found that the maximum number of ‘product_id’ repetitions is 3. It is assumed that the same product uses the same ID in three languages.

LYZD-fintech · March 29, 2022, 4:45pm

@ mohanty，This problem leads to a training data that can be combined with up to three products to form a complete set of data, which is obviously unreasonable. At the same time, my test file becomes 402953 lines. I don’t know what to do.

hrishabh_upadhyay · March 30, 2022, 10:27am

I also had this problem, But after checking I found that It’s for each locale, us, jp and es.

fvalero · March 30, 2022, 10:28am

Exactly, @LYZD-fintech you are right, 'product_catalogue-v0.1.csv’ could contain the same ‘product_id’ in different ‘product_locale’, therefore given an row of ‘train-v0.1.csv’ you must merge it with 'product_catalogue-v0.1.csv’ according to the product_id and locale. Below, I show you an example of how you can do it:

df = pd.merge(
    pd.read_csv('train-v0.1.csv') , 
    pd.read_csv('product_catalogue-v0.1.csv'), 
    how='left', 
    left_on=['query_locale','product_id'], 
    right_on=['product_locale', 'product_id']
)

mohanty · April 1, 2022, 3:46pm

Each product_id can have different descriptions etc depending on the product_locale. So please use the product_locale to match with the query_locale when trying to map the product_catalogue to the training set or the test set.

More Details

Best,
Mohanty

LYZD-fintech · April 1, 2022, 3:54pm

thanks，your code solved my problem !