Hi @Robert_Stefanazzi
How the RMSE leaderboard works
The way we generated predictions to compute the RMSE leaderboard scores is a stylised version of how a model in the real world might do so: it is given access to some past data to predict the present.
To compute your total RMSE for the 4 years of data in the leaderboard, we need to make predictions for years 1 - 4 and then plug that into the standard RMSE formula. However, we don’t want you to be able to look at the features of a contract in the future (years 2, 3, 4) to make predictions about the past (years 1, 2, 3). So when we compute your predictions for year X we only give you data about years X and prior.
In other words, when we say:
It means that when your model is generating predictions for year 2, it only has access to years 1 and 2. We then gather predictions for the remaining years using the same principle and finally compute your RMSE score.
Your specific questions
-
Intersection of training and leaderboard data. None of the policy IDs in the RMSE leaderboard data are present in the training data.
-
How are predictions made. You asked how the
predict_expected_claim
works in this setting. The answer is that the X_raw
given to your model does not contain 4 years of history. It just contains the right amount. Remember that the final test data will only contain 1 year of data (year 5)! The reason it’s done this way is that you might built a model to look at whether a car is severely devalued in the future to understand what happened in it’s past. That’s what we’re tying to prevent with this year by year method of RMSE computation.
-
Model trained on 4 years used to predict less. You are correct that your model will have had access to 4 years of data while training and this will show when predicting year 4 of the data. However, since these models are supposed to be designed to generate predictions on a contract-by-contract basis, this should not be too problematic.
I hope this clarifies things
If not please respond here and we’ll discuss it further 