How the RMSE leaderboard works
The way we generated predictions to compute the RMSE leaderboard scores is a stylised version of how a model in the real world might do so: it is given access to some past data to predict the present.
To compute your total RMSE for the 4 years of data in the leaderboard, we need to make predictions for years 1 - 4 and then plug that into the standard RMSE formula. However, we don’t want you to be able to look at the features of a contract in the future (years 2, 3, 4) to make predictions about the past (years 1, 2, 3). So when we compute your predictions for year X we only give you data about years X and prior.
In other words, when we say:
It means that when your model is generating predictions for year 2, it only has access to years 1 and 2. We then gather predictions for the remaining years using the same principle and finally compute your RMSE score.
Your specific questions
Intersection of training and leaderboard data. None of the policy IDs in the RMSE leaderboard data are present in the training data.
How are predictions made. You asked how the
predict_expected_claim works in this setting. The answer is that the
X_raw given to your model does not contain 4 years of history. It just contains the right amount. Remember that the final test data will only contain 1 year of data (year 5)! The reason it’s done this way is that you might built a model to look at whether a car is severely devalued in the future to understand what happened in it’s past. That’s what we’re tying to prevent with this year by year method of RMSE computation.
Model trained on 4 years used to predict less. You are correct that your model will have had access to 4 years of data while training and this will show when predicting year 4 of the data. However, since these models are supposed to be designed to generate predictions on a contract-by-contract basis, this should not be too problematic.
I hope this clarifies things If not please respond here and we’ll discuss it further