Question about `pol_no_claims_discount`

alan_feder · December 24, 2020, 7:35pm

Hi,

I am curious about the column pol_no_claims_discount. From the pdf data dictionary, it seems as if it can only decrease by 0.047 or increase by a multiple of 0.203 from year-to-year (unless it would hit 0 or 1).

However, this does not seem to be the case. For example, from id_policy = PL000000, the sequence goes 0.332, 0.280, 0.225, 0.166 – for differences of -0.052, -0.055, -0.059.

Similarly, from id_policy = “PL000118”, pol_no_claims_discount is 0 for years 1 and 2. In year 2, there is at least one claim (claim_amount = 1096.21). However, for year 3, pol_no_claims_discount is still 0. Moreover, even through there is no claim in year 3, for year 4, `pol_no_claims_discount_ is now 0.196 – not 0.203 or 0.156, as I would have expected.

Is there something I am misunderstanding about how this column works?

Thanks!

alfarzan · December 24, 2020, 10:09pm

Hi @alan_feder

Since this is real data, sometimes there are minor issues like this one. We chose not to do any additional per-processing to preserve the realisticness of the data.

To answer your questions:

Not all claim types count. As stated in the data dictionary only certain claim types are taken into account in this calculation. For example, the value might not go down in the case of a rock breaking the windshield. This is why sometimes you see a claim occur but no change in this feature is seen.
The values are not exact. As this is real data, the data provider might have pre-processed this feature and you will find that the values are not always exact. However, generally they decrease by approximately 0.05 when no claim is present and increase by 0.2 when a claim is present.
Delays happen. Sometimes a claim might appear on the books but the pol_no_claims_discount column might not be updated until the next year. This could be for example due to the timing of the accident in the year or other administrative issues.

However, broadly speaking this feature acts as a signal as to how risky the driver is, in the eyes of the company at the time of sign up.

Thanks for noting this as I have now updated the data dictionary with the wording in point (2) above for increased clarity.

alan_feder · December 25, 2020, 3:47am

Thanks!

Does this mean that, theoretically, (at least in real life, even if not in the contest), our “training data” might be out of date? We could have a claim where our data shows claim_amount == 0, but if we rechecked the data in a few months, it would have had a claim?

alfarzan · December 25, 2020, 2:05pm

In reality that would depend on the company in question and what the pricing model is used for. It is in the company interest to ensure the training data is up to date for any modelling purpose for clear reasons and most companies would be very careful to ensure this data is up-to-date as that is their bread and butter.

However in this dataset, all policies cover a full year and the claims are recorded at the end of the year. Companies are usually much more careful about the claim amount column than other columns for obvious reasons. The claims here are all up to date and we don’t have cases where a claim has occurred and was meant to be recorded, but has been delayed in being recorded. At least not to my knowledge or that of the data provider