"ValueError: could not broadcast input array from shape (228216) into shape (19016)

yixxas · December 25, 2020, 5:31pm

No idea why this error is happening. I did not change much in the 3 functions and followed the template closely and also followed the rules for submissions. I was also able to generate the 2 csv files with the test.bat file locally. Please do help me!

alfarzan · December 25, 2020, 6:25pm

Hi @yixxas

I’ve looked at your submission ( #110888) and unfortunately it seems like your fit_model function is empty. So I cannot retrain your model to reproduce the problem and help you. I should mention that it is a requirement of the submissions in this challenge to include a viable fit_model function, otherwise your submissions may, and most likely, will be disqualified.

I can see that in this instance your trained_model.pickle file is actually just a numpy array with prices for your training data. What happens is that we then try to use these prices in the leaderboard which has a different dataset, of a different size, leading to the error you are observing.

If you submit a version with the fit_model function present and still face the same issue, then I might be able to help further by looking at the submission. Otherwise, you could try the notebook submission first to see if that works, and then iterate by copying what you put in the notebook functions into your model.py file.

If any of this is unclear please let me know and we can look into it further

yixxas · December 25, 2020, 8:22pm

Hello! Thank you so much for your kind response.
I figured out the earlier problem but now with my newest submission, I’m getting a new error

“ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1015 is different from 511)”.

This seems to have to do with the model being made to predict on a dataset with different number of features where it is trained on? Not sure how to fix this error as I did the data cleaning in my prediction function in the same way as I trained my model.

Oh, and the reason I removed the code in my fit_model function is that I read somewhere that fit_model should not contain code that retrains my model.

alfarzan · December 25, 2020, 8:54pm

Ah yes, this might be happening because you are one-hot-encoding the vh_make_model column in the data without handling the case where a previously unseen value from that column might exist in the leaderboard data. In such a case due to this treatment you will have more columns (due to the one-hot-encoding) in the leaderboard data that than in the training data.

If you choose to do something like this, then your model should also account for cases where you encounter a previously unseen vehicle make and model. Similarly, in the real world, insurance companies must offer a premium prices to a vehicle that they may have not encountered before.

We will make an announcement regarding the fit_model function to make sure it’s clear to everyone.

yixxas · December 25, 2020, 9:29pm

Ah, that makes sense. That did not cross my mind at all. Will see what I can do.
Thank you so much for your time!

yixxas · December 30, 2020, 11:32pm

Hi @alfarzan. Able to help me check where the error is happening at in my latest submission 111784? Spent quite some time trying to figure out but unable to.

alfarzan · December 31, 2020, 12:17am

Hi @yixxas

I believe the issue is coming from your label_encode function as everything else seems done properly. My impression is that you are either dividing by zero somewhere to get an infinity. NaNs can be a little tricky in these situations so I suggest following some of the steps recommended here and trying a few more times. If you are still unsuccessful please reply to this thread and I will try to reproduce your error on my machine and figure it out

Another thing to note, I noticed that your cleaning functions are dropping some rows. In the competition, as in real life, your company is asked to provide a quote for every policy, so you cannot drop rows from the data in data cleaning (outside of your training of course). You should have a condition where you offer, say a very high price, for policies that you would usually drop. Effectively rejecting them.

Please keep us updated

yixxas · December 31, 2020, 1:06am

Thank you so much! Managed to fixed it. The error indeed occurred because of division by zero. I was too fixated on NaN/NIL value existing in my dataframe that I missed it!

Understood. Originally wrote the function just for cleaning of the training data but got lazy in the end and used it for prediction, forgetting my original intent.