Error adding a calculated field to the cleaned dataset

Hi

I keep receiving an error when I try add a calculated field to the cleaned x_raw dataset. Why is this?
E.g. if I do something like x_raw$calc=x_raw$claim_amount*(1+0.55) in the preprocess_X_data function I receive the following error:
Error in $<-.data.frame(*tmp*, calc, value = numeric(0)) :
replacement has 0 rows, data has 2000

I also note that some functions like ifelse don’t work, why?

I find that I can only create new fields in the set that are constants e.g. x_raw$calc=1

You are trying to use the response variable in the preprocessing, claim_amount, which is not available server side.

You only get to use it for training purposes.

1 Like

but in the pre-processing I can do this: data_1$severity<-data_1$claim_amount and use severity in the model and it works fine

In the first lines, you can see that the data is separated into x and y.

X = all variables, except claim_amount
Y = only claim_amount

In your first message, you expect claim_amount to be within your X raw. But it’s not! And you will not have access to this variable when submitting your model.

So what I found that actually works is if you don’t assign it to the dataset in the pre-processing. E.g. Use calc=x_raw$claim_amount*(1+0.55) then you have the vector numeric calc and reference it in the modelling that follows

Hi @kamil_singh

Yes as @michael_bordeleau has said, this is only possible in the notebook environment because your training dataset contains the claim_amount column.

When you submit your model to our servers, the data that is fed to the functions predict_expected_claim and predict_premium does not include that column at all. So when we read that data from CSV, that CSV file does not include claim_amount, so your function has no way to creating a variable using the claim_amount column.

To re-iterate, this only works on colab because your input data includes this claim_amount column.

I see what you mean, thank you @alfarzan and @michael_bordeleau. I want to manipulate all input datasets using preprocess_X_data function, not just training set. I am able to do that if I don’t assign to a dataset calculated fields that use the variables in only the training set and instead just use a vector. If that makes any sense.

1 Like

Yes I think that should work. The best way to check is to just try a submission with a skeleton of what you are hoping to use and you’ll see errors pop up.

I do encourage you to try to get a working submission before heavy modelling so that you don’t have to go back and rework your code after it’s already become inflexible due to the model complexity :slight_smile: