I have a couple of questions on the RMSE and what exactly the data looks like.
Do the policies in this leaderboard exist in the training data?
It is said that: “Your model makes predictions for Year 2 with access to data from years 1 - 2.”
Were the policies part of the training data i could see how this could be true, but if it’s not I’m not sure how that would work as predict_expected_claim function only takes two arguments: trained_model and X_train.
Also given our model is trained on 4 years of data would it not be true to say that our models are making predictions for year 2 policies with access to 4 years of claim info and we need to decide ourselves what info exactly to exclude for rating year 2 policies?
The way we generated predictions to compute the RMSE leaderboard scores is a stylised version of how a model in the real world might do so: it is given access to some past data to predict the present.
To compute your total RMSE for the 4 years of data in the leaderboard, we need to make predictions for years 1 - 4 and then plug that into the standard RMSE formula. However, we don’t want you to be able to look at the features of a contract in the future (years 2, 3, 4) to make predictions about the past (years 1, 2, 3). So when we compute your predictions for year X we only give you data about years X and prior.
In other words, when we say:
It means that when your model is generating predictions for year 2, it only has access to years 1 and 2. We then gather predictions for the remaining years using the same principle and finally compute your RMSE score.
Your specific questions
Intersection of training and leaderboard data. None of the policy IDs in the RMSE leaderboard data are present in the training data.
How are predictions made. You asked how the predict_expected_claim works in this setting. The answer is that the X_raw given to your model does not contain 4 years of history. It just contains the right amount. Remember that the final test data will only contain 1 year of data (year 5)! The reason it’s done this way is that you might built a model to look at whether a car is severely devalued in the future to understand what happened in it’s past. That’s what we’re tying to prevent with this year by year method of RMSE computation.
Model trained on 4 years used to predict less. You are correct that your model will have had access to 4 years of data while training and this will show when predicting year 4 of the data. However, since these models are supposed to be designed to generate predictions on a contract-by-contract basis, this should not be too problematic.
I hope this clarifies things If not please respond here and we’ll discuss it further
I think what was throwing me is I thought that for year 2 we’d have the claim cost for year 1 as part of the ‘data from years 1 - 2’ and I wasn’t sure if we could incorporate that.
Yes, just to clarify, you don’t have access to the claim column in any of the leaderboard datasets available. Mentioning in case your model is expecting such a column.
I think it would be great if we can have the claim amount column (from year 1 to t-1 if we are predicting year t, with year t value as NA) as part of the leaderboard dataset, as this is quite a significant feature.
Apart from that, you mentioned in the final test dataset, there is only year 5 data. I think part of the policy ids are in our training dataset, so we are actually able to make use of the historical claim amount columm as feature. Is my understanding correct?
The main reason we don’t do this is that the purpose of the leaderboards is to have a dataset as similar to the test data as possible. The test data does not have the claim_amount in it as it is a simulation of incoming policies. Anything known about the past of the policies should be contained within the pricing model.
So in other words, this leaderboard is trying to provide the opportunity to make sure
The risk modelling is done well
The model is ready to be placed in the weekly markets
Having said that, there are other strong signals in the data regarding accidents in previous years that can be used.
def evaluate_rmse(datasets, predicted_claims):
"""Evaluate the RMSE for a given model, by considering 4 datasets
independently. This circumvents the possibility of using future
data to predict the past. For this reason, we treat datasets of years
1-1, 1-2, 1-3, 1-4, and evaluate the error on the last year in each
dataset. This allows models to use temporal information, but without
looking into the future.
Inputs:
- datasets: list of 4 dataframes with years 1-1, 1-2, 1-3, 1-4
- predicted_claims: list of 4 arrays with prices generated for
the final year by the model given each
corresponding dataframe in datasets
Returns:
- float corresponding to root mean squared error
"""
sum_squared_error = 0
n_records = 0
for year, (df, predicted) in enumerate(zip(datasets, predicted_claims)):
_filter = df.year == (year + 1)
error = predicted[_filter] - df.claim_amount[_filter]
sum_squared_error += (error ** 2).sum()
n_records += _filter.sum() # Count the number of records counted here.
# Then, convert this to RMSE.
return np.sqrt(sum_squared_error / n_records)