Hi
I keep receiving an error when I try add a calculated field to the cleaned x_raw dataset. Why is this?
E.g. if I do something like x_raw$calc=x_raw$claim_amount*(1+0.55) in the preprocess_X_data function I receive the following error:
Error in $<-.data.frame
(*tmp*
, calc, value = numeric(0)) :
replacement has 0 rows, data has 2000
I also note that some functions like ifelse don’t work, why?
I find that I can only create new fields in the set that are constants e.g. x_raw$calc=1
You are trying to use the response variable in the preprocessing, claim_amount, which is not available server side.
You only get to use it for training purposes.
1 Like
but in the pre-processing I can do this: data_1$severity<-data_1$claim_amount and use severity in the model and it works fine
In the first lines, you can see that the data is separated into x and y.
X = all variables, except claim_amount
Y = only claim_amount
In your first message, you expect claim_amount to be within your X raw. But it’s not! And you will not have access to this variable when submitting your model.
So what I found that actually works is if you don’t assign it to the dataset in the pre-processing. E.g. Use calc=x_raw$claim_amount*(1+0.55) then you have the vector numeric calc and reference it in the modelling that follows
Hi @kamil_singh
Yes as @michael_bordeleau has said, this is only possible in the notebook environment because your training dataset contains the claim_amount
column.
When you submit your model to our servers, the data that is fed to the functions predict_expected_claim
and predict_premium
does not include that column at all. So when we read that data from CSV, that CSV file does not include claim_amount
, so your function has no way to creating a variable using the claim_amount
column.
To re-iterate, this only works on colab because your input data includes this claim_amount
column.
I see what you mean, thank you @alfarzan and @michael_bordeleau. I want to manipulate all input datasets using preprocess_X_data function, not just training set. I am able to do that if I don’t assign to a dataset calculated fields that use the variables in only the training set and instead just use a vector. If that makes any sense.
1 Like
Yes I think that should work. The best way to check is to just try a submission with a skeleton of what you are hoping to use and you’ll see errors pop up.
I do encourage you to try to get a working submission before heavy modelling so that you don’t have to go back and rework your code after it’s already become inflexible due to the model complexity 