RMSE Leaderboard Clarification

robert_stefanazzi · January 3, 2021, 7:35pm

Hi,

I have a couple of questions on the RMSE and what exactly the data looks like.

Do the policies in this leaderboard exist in the training data?
It is said that:
“Your model makes predictions for Year 2 with access to data from years 1 - 2.”

Were the policies part of the training data i could see how this could be true, but if it’s not I’m not sure how that would work as predict_expected_claim function only takes two arguments: trained_model and X_train.

Also given our model is trained on 4 years of data would it not be true to say that our models are making predictions for year 2 policies with access to 4 years of claim info and we need to decide ourselves what info exactly to exclude for rating year 2 policies?

Would be good to understand this better!

Thanks!

alfarzan · January 3, 2021, 8:04pm

Hi @Robert_Stefanazzi

How the RMSE leaderboard works

The way we generated predictions to compute the RMSE leaderboard scores is a stylised version of how a model in the real world might do so: it is given access to some past data to predict the present.

To compute your total RMSE for the 4 years of data in the leaderboard, we need to make predictions for years 1 - 4 and then plug that into the standard RMSE formula. However, we don’t want you to be able to look at the features of a contract in the future (years 2, 3, 4) to make predictions about the past (years 1, 2, 3). So when we compute your predictions for year X we only give you data about years X and prior.

In other words, when we say:

It means that when your model is generating predictions for year 2, it only has access to years 1 and 2. We then gather predictions for the remaining years using the same principle and finally compute your RMSE score.

Your specific questions

Intersection of training and leaderboard data. None of the policy IDs in the RMSE leaderboard data are present in the training data.
How are predictions made. You asked how the predict_expected_claim works in this setting. The answer is that the X_raw given to your model does not contain 4 years of history. It just contains the right amount. Remember that the final test data will only contain 1 year of data (year 5)! The reason it’s done this way is that you might built a model to look at whether a car is severely devalued in the future to understand what happened in it’s past. That’s what we’re tying to prevent with this year by year method of RMSE computation.
Model trained on 4 years used to predict less. You are correct that your model will have had access to 4 years of data while training and this will show when predicting year 4 of the data. However, since these models are supposed to be designed to generate predictions on a contract-by-contract basis, this should not be too problematic.

I hope this clarifies things If not please respond here and we’ll discuss it further

robert_stefanazzi · January 3, 2021, 8:18pm

@alfarzan,

Thanks for the reply!

I think what was throwing me is I thought that for year 2 we’d have the claim cost for year 1 as part of the ‘data from years 1 - 2’ and I wasn’t sure if we could incorporate that.

Rob

alfarzan · January 4, 2021, 12:20am

Yes, just to clarify, you don’t have access to the claim column in any of the leaderboard datasets available. Mentioning in case your model is expecting such a column.

davidlkl · January 4, 2021, 6:22am

Hi @alfarzan,

I think it would be great if we can have the claim amount column (from year 1 to t-1 if we are predicting year t, with year t value as NA) as part of the leaderboard dataset, as this is quite a significant feature.

Apart from that, you mentioned in the final test dataset, there is only year 5 data. I think part of the policy ids are in our training dataset, so we are actually able to make use of the historical claim amount columm as feature. Is my understanding correct?

alfarzan · January 4, 2021, 11:41am

Hi @davidlkl

The main reason we don’t do this is that the purpose of the leaderboards is to have a dataset as similar to the test data as possible. The test data does not have the claim_amount in it as it is a simulation of incoming policies. Anything known about the past of the policies should be contained within the pricing model.

So in other words, this leaderboard is trying to provide the opportunity to make sure

The risk modelling is done well
The model is ready to be placed in the weekly markets

Having said that, there are other strong signals in the data regarding accidents in previous years that can be used.

huan_vo · January 10, 2021, 3:45pm

Hi is it possible to share the code of how the rmse is computed for the leaderboard?

alfarzan · January 10, 2021, 4:15pm

Hi @huan_vo

Yes of course. Here it is:

def evaluate_rmse(datasets, predicted_claims):
    """Evaluate the RMSE for a given model, by considering 4 datasets
    independently. This circumvents the possibility of using future
    data to predict the past. For this reason, we treat datasets of years
    1-1, 1-2, 1-3, 1-4, and evaluate the error on the last year in each
    dataset. This allows models to use temporal information, but without
    looking into the future.

    Inputs:
        - datasets: list of 4 dataframes with years 1-1, 1-2, 1-3, 1-4
        - predicted_claims: list of 4 arrays with prices generated for
                            the final year by the model given each 
                            corresponding dataframe in datasets
    Returns:
        - float corresponding to root mean squared error
    """
    sum_squared_error = 0
    n_records = 0
    for year, (df, predicted) in enumerate(zip(datasets, predicted_claims)):
        _filter = df.year == (year + 1)
        error = predicted[_filter] - df.claim_amount[_filter]
        sum_squared_error += (error ** 2).sum()
        n_records += _filter.sum()  # Count the number of records counted here.
    # Then, convert this to RMSE.
    return np.sqrt(sum_squared_error / n_records)

huan_vo · January 10, 2021, 6:42pm

That is great. Thank you!

JuMi · January 22, 2021, 7:59am

Thank you!

Would it also be possible to post the R-code of the RMSE computation for the leaderboard as well?

alfarzan · January 22, 2021, 11:51am

Hi @JuMi

Unfortunately there is no R code to compute RMSE. We generate all of your prices into CSV files and we use python from there.

But I can expand on something if it’s not clear?

JuMi · January 23, 2021, 6:10pm

No, Thank you, that is fine