More on using historical data for future predictions


I have a question that sorta relates to some of the issues discussed here and here (and driven from failed submission 111543 - using R and the Google Colab submission -which works with the data as provided in the notebook)

I first pre-process the data, and attempt to create lag columns with the previous claim_amount. I am aware that many of the contracts will be new – so I’m fine if they are NA (xgboost can handle that). But for the contracts that previously existed, I’d like to use past targets as features. I tried to add an argument to each function that has y_raw - but now I get the error “” Feature names stored in object and newdata are different!"" – which is coming from xgboost’s attempt to use the trained model (which has y_raw) on “new data” (which does not).

Does this mean that we cannot use past claim_amounts, even for insureds that previously had policies?




Good question. I have similar questions on the same.

Only way i can think to get around it is to have policy_id in the model as a factor itself, but would much rather have preprocessed columns with historic claims but given our model needs to run on single line data for year 5 policies i don’t know how this would be possible unless we can somehow store more than just a training model to be passed on…

Would be good to hear from the admins on this?

1 Like

Hi @alan_feder

This is an interesting question indeed.

As @robstef points out, the aim of the models is to run only on a single row of the data (one contract) for the fifth year. The MSE leaderboard data provided is to emulate that dataset, it will never contain the claim_amount as a column. Therefore you are not able to use the claim_amount column as a feature in the MSE leaderboard.

Having said that, without giving much away I can safely say that there are other signals in the data that tell you if a claim has occurred in previous years, when you have the previous years’ data as well.

As @robstef has hinted towards, you could keep a memory of what contracts you consider risky in your saved model object, so that when your model is loaded, it first checks whether it is one of these historically risky contracts, and then alters the final premium depending on that flag.


Thanks to @alfarzan and @robstef for their advice.

I guess that would mean that any data processing of this form would have to be contained in the model object itself, and not in preprocess_X_data() (or even in the fit_model() function) - which would likely preclude me from simply saving the “trained model object” as an output from xgboost or keras or lm or whatever.

Thanks again!

Yes, the aim of the preprocessing is to clean the data, not to learn from it. For building this kind of historical database, the output should be something that you save in the trained model object indeed :slight_smile:

1 Like

Hi @alfarzan,

Instead of just a flag, could I also store the historical aggregated claim amount as a feature?

I understand that there would be new policies with no historical data available. So my approach would be to have two separate models, using / not using the historical data.

1 Like

Hi @davidlkl

Yes! Absolutely, you can have any sort of feature engineering, including saving aggregate claim information. Just make sure that this is saved in your model object so that when it is loaded it can be used.

1 Like

Thanks for the explaination!

Just some follow up questions:

  1. RMSE/weekly profit board test data
  • is there any new policy in the test sets?
  1. Is it possible to have a sample test dataset?
  • It is a bit hard to code without a sample file with limited submissions.

The spirit of this competition is to try and simulate as much as possible, real world market conditions without impeding too much the ability of participants to learn data science and play at their own pace and leisure.

One of the key features of a real world market is that you don’t know what the incoming years’ portfolio that you will price looks like in advance. For this reason we don’t provide the leaderboard data or test data in advance.

However there is a bit more detail about what these policies look like in the leaderboard description on the overview page.

We’re also working on giving you feedback after each weekly leaderboard about the dataset of the past week to help with the learning process :gear: