RMSE Leaderboard Clarification

Hi,

I have a couple of questions on the RMSE and what exactly the data looks like.

  • Do the policies in this leaderboard exist in the training data?

  • It is said that:
    “Your model makes predictions for Year 2 with access to data from years 1 - 2.”

Were the policies part of the training data i could see how this could be true, but if it’s not I’m not sure how that would work as predict_expected_claim function only takes two arguments: trained_model and X_train.

Also given our model is trained on 4 years of data would it not be true to say that our models are making predictions for year 2 policies with access to 4 years of claim info and we need to decide ourselves what info exactly to exclude for rating year 2 policies?

Would be good to understand this better!

Thanks!

Hi @Robert_Stefanazzi

How the RMSE leaderboard works :gear:

The way we generated predictions to compute the RMSE leaderboard scores is a stylised version of how a model in the real world might do so: it is given access to some past data to predict the present.

To compute your total RMSE for the 4 years of data in the leaderboard, we need to make predictions for years 1 - 4 and then plug that into the standard RMSE formula. However, we don’t want you to be able to look at the features of a contract in the future (years 2, 3, 4) to make predictions about the past (years 1, 2, 3). So when we compute your predictions for year X we only give you data about years X and prior.

In other words, when we say:

It means that when your model is generating predictions for year 2, it only has access to years 1 and 2. We then gather predictions for the remaining years using the same principle and finally compute your RMSE score.

Your specific questions :thinking:

  • Intersection of training and leaderboard data. None of the policy IDs in the RMSE leaderboard data are present in the training data.
  • How are predictions made. You asked how the predict_expected_claim works in this setting. The answer is that the X_raw given to your model does not contain 4 years of history. It just contains the right amount. Remember that the final test data will only contain 1 year of data (year 5)! The reason it’s done this way is that you might built a model to look at whether a car is severely devalued in the future to understand what happened in it’s past. That’s what we’re tying to prevent with this year by year method of RMSE computation.
  • Model trained on 4 years used to predict less. You are correct that your model will have had access to 4 years of data while training and this will show when predicting year 4 of the data. However, since these models are supposed to be designed to generate predictions on a contract-by-contract basis, this should not be too problematic.

I hope this clarifies things :+1: If not please respond here and we’ll discuss it further :slight_smile:

1 Like

@alfarzan,

Thanks for the reply!

I think what was throwing me is I thought that for year 2 we’d have the claim cost for year 1 as part of the ‘data from years 1 - 2’ and I wasn’t sure if we could incorporate that.

Rob

1 Like

Yes, just to clarify, you don’t have access to the claim column in any of the leaderboard datasets available. Mentioning in case your model is expecting such a column.

Hi @alfarzan,

I think it would be great if we can have the claim amount column (from year 1 to t-1 if we are predicting year t, with year t value as NA) as part of the leaderboard dataset, as this is quite a significant feature.

Apart from that, you mentioned in the final test dataset, there is only year 5 data. I think part of the policy ids are in our training dataset, so we are actually able to make use of the historical claim amount columm as feature. Is my understanding correct?

Hi @davidlkl

The main reason we don’t do this is that the purpose of the leaderboards is to have a dataset as similar to the test data as possible. The test data does not have the claim_amount in it as it is a simulation of incoming policies. Anything known about the past of the policies should be contained within the pricing model.

So in other words, this leaderboard is trying to provide the opportunity to make sure

  1. The risk modelling is done well
  2. The model is ready to be placed in the weekly markets

Having said that, there are other strong signals in the data regarding accidents in previous years that can be used.

Hi is it possible to share the code of how the rmse is computed for the leaderboard?

1 Like

Hi @huan_vo

Yes of course. Here it is:

def evaluate_rmse(datasets, predicted_claims):
    """Evaluate the RMSE for a given model, by considering 4 datasets
    independently. This circumvents the possibility of using future
    data to predict the past. For this reason, we treat datasets of years
    1-1, 1-2, 1-3, 1-4, and evaluate the error on the last year in each
    dataset. This allows models to use temporal information, but without
    looking into the future.

    Inputs:
        - datasets: list of 4 dataframes with years 1-1, 1-2, 1-3, 1-4
        - predicted_claims: list of 4 arrays with prices generated for
                            the final year by the model given each 
                            corresponding dataframe in datasets
    Returns:
        - float corresponding to root mean squared error
    """
    sum_squared_error = 0
    n_records = 0
    for year, (df, predicted) in enumerate(zip(datasets, predicted_claims)):
        _filter = df.year == (year + 1)
        error = predicted[_filter] - df.claim_amount[_filter]
        sum_squared_error += (error ** 2).sum()
        n_records += _filter.sum()  # Count the number of records counted here.
    # Then, convert this to RMSE.
    return np.sqrt(sum_squared_error / n_records)
1 Like

That is great. Thank you!

1 Like