The ‘product_id’ field is not unique in the product table, which leads to ‘train-v0.1.csv’ merge ‘product_catalogue-v0.1.csv’ will make the number of rows in the result data exceed the number of lines in the original ‘train-v0.1.csv’. Simple analysis found that the maximum number of ‘product_id’ repetitions is 3. It is assumed that the same product uses the same ID in three languages.
@ mohanty，This problem leads to a training data that can be combined with up to three products to form a complete set of data, which is obviously unreasonable. At the same time, my test file becomes 402953 lines. I don’t know what to do.
I also had this problem, But after checking I found that It’s for each locale, us, jp and es.
Exactly, @Guan_Yu_Hang you are right, 'product_catalogue-v0.1.csv’ could contain the same ‘product_id’ in different ‘product_locale’, therefore given an row of ‘train-v0.1.csv’ you must merge it with 'product_catalogue-v0.1.csv’ according to the product_id and locale. Below, I show you an example of how you can do it:
df = pd.merge( pd.read_csv('train-v0.1.csv') , pd.read_csv('product_catalogue-v0.1.csv'), how='left', left_on=['query_locale','product_id'], right_on=['product_locale', 'product_id'] )
product_id can have different descriptions etc depending on the
product_locale. So please use the
product_locale to match with the
query_locale when trying to map the
product_catalogue to the training set or the test set.
thanks，your code solved my problem !