The ‘product_id’ field is not unique in the product table, which leads to ‘train-v0.1.csv’ merge ‘product_catalogue-v0.1.csv’ will make the number of rows in the result data exceed the number of lines in the original ‘train-v0.1.csv’. Simple analysis found that the maximum number of ‘product_id’ repetitions is 3. It is assumed that the same product uses the same ID in three languages.
@ mohanty,This problem leads to a training data that can be combined with up to three products to form a complete set of data, which is obviously unreasonable. At the same time, my test file becomes 402953 lines. I don’t know what to do.
I also had this problem, But after checking I found that It’s for each locale, us, jp and es.
Exactly, @LYZD-fintech you are right, 'product_catalogue-v0.1.csv’ could contain the same ‘product_id’ in different ‘product_locale’, therefore given an row of ‘train-v0.1.csv’ you must merge it with 'product_catalogue-v0.1.csv’ according to the product_id and locale. Below, I show you an example of how you can do it:
df = pd.merge(
pd.read_csv('train-v0.1.csv') ,
pd.read_csv('product_catalogue-v0.1.csv'),
how='left',
left_on=['query_locale','product_id'],
right_on=['product_locale', 'product_id']
)
Each product_id
can have different descriptions etc depending on the product_locale
. So please use the product_locale
to match with the query_locale
when trying to map the product_catalogue
to the training set or the test set.
Best,
Mohanty
thanks,your code solved my problem !