Selection Bias: Are training data and evaluation data drawn from the same population?

I put this competition on my possible project list and am only getting around to data exploration now.
The training dataset includes only policies with four years of claims data (and from the description these policies are renewed for a fifth year). But, the evaluation dataset has data that might not be renewed in the next year (unless I’m misreading the description). Does that not bias the sample in a potentially important way if that constraint is not there for the evaluation datasets? There are many reasons why a policy would not be renewed but one would be if the driver was involved in a “very bad crash”. Is there a danger of selection bias where “very bad crashes” are under-represented in the training data but are more likely to be in the evaluation data?

Or am I misunderstanding the description?

Hi @neill_sweeney and glad to have you on board :partying_face:

All the policies in the training data will be renewed for the 5th year. However there will also be new policies only for the 5th year that have not existed in the training data.

This is to simulate the fact that, as a company, part of your portfolio will be entirely new policies.

To answer the question about bias, yes it would bias the data if we were taking out these non-renewed policies every year for whatever reason, without introducing new ones every year. But that is not the case here.

I hope that clarifies things a bit :+1:

Okay, we have no training data for the last year of a policy. But we will be evaluated on data that will include the last year of some policies.

I am assuming that size of the claim in a year will be correlated with the probability of renewing that policy next year (e.g. a “very bad crash”). If someone has experience in motor insurance, they might be able to comment whether that is a reasonable assumption and is significant.

I am not concerned with the number of previous years data available. As you say, any practical model should be able to cope with different ammounts of historical data and we have training data to learn from for up to three years previous data. But we have no training data for the last years of a policy. Ideally, the training and evaluation data should be sampled from the same dataset. For this competition, that would mean adding a proportional sample of policies that didn’t renew to the training data or removing any policy that didn’t renew from the evaluation data.
Anyway, sorry to bang on about sampling theory but my Statistics lecturer (from the last millennium) always stressed how important it was to be careful about making sure the sample was drawn from the population of interest.

Yes sampling is very important of course!

To add some details:

  1. Sampling bias for “very bad crashes”. We have been told that all policies with crashes that typically have an extremely large claim (in the realm of many millions of €) have been excluded from all 5 years of the data. So it’s not like they appear in the training and not elsewhere.
  2. Training and evaluation sampling. The training and test data are sampled uniformly from the entire portfolio that exists. However the difference is that all the final year is used for test. This is to simulate the real world where you would not know what to expect in terms of distributional changes for the future year. However, everyone has renewed all the time in this 100K set. They are all sampled from the same data provider uniformly across their entire country-wide portfolio.
  3. Format of test set. Just to clarify something, the test set will only include 100K rows, all policies with year = 5. It will not include the 4 years of history. If you want to use that history for the contracts that you do have then you should encode that into your model :slight_smile:

It is highly unlikely the training set was sampled uniformly from an entire portfolio. If it was, there would be policy_id’s with one or two years claims data. (Unless this company has the best customer service in the world and 100% loyalty.) It might have been uniformly sampled from policies with 5 years data but that isn’t the same thing. The problem is we have no training data for the last year of policies that don’t renew. Effectively, there is data leakage from the future in the training set because we know they renew.
But it is not as big a problem as I thought because the “extremely large claim”/“very bad crash” category is removed. So maybe, its not a big problem at all. I have no idea whether there’s a statistical relationship between a customer making a claim and renewing for the next year and I won’t be able to work that out from the training set.
Anyway, I’ll stop banging on about “selection bias” now.

1 Like

in practice they’d get a big rate hike and be less likely to renew, hence why renewal books of business tend to be better