The full traceback of your error is copied below. It seems that the main issue is the NameError you are getting at the last line.
PatsyError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/statsmodels/base/model.py in predict(self, exog, transform, *args, **kwargs)
1019 '\n\nThe original error message returned by patsy is:\n'
1020 '{0}'.format(str(str(exc))))
-> 1021 raise exc.__class__(msg)
1022 if orig_exog_len > len(exog) and not is_dict:
1023 import warnings
PatsyError: predict requires that you use a DataFrame when predicting from a model
that was created using the formula api.
The original error message returned by patsy is:
Error evaluating factor: NameError: no data named 'pol_duration:[0, 5)' found
I guess the error is due to the fact that I use pd.get_dummies() to generate my X training data columns in preprocess_X_data(X_raw) function.
I learn a model on the whole training dataset with corresponding generated dummy columns and no problem occurs on the whole dataset.
But I guess the error occurs in the final RMSE evaluation, if the preprocess_X_data(X_raw) is called on a part of this training data, it is likely that some of the dummy columns with which my model was trained are missing in this subset of data…
Quite a typical error…
Could you please confirm that in the RMSE evaluation process, the function preprocess_X_data(X_raw) is in indeed called on a part of the training dataset ?
Anyway, I think I should correct the way I generate my dummy columns to work also with part of the training data.
@sai_krithik : Yes I was able to make a successful submission with provided example notebook
Within any of the evaluation processes your functions are called as is. That means that if you have called your preprocess_X_data(X_raw) early in the function definition then it will run.