It's (almost) over! sharing approaches

When the differece in RMSE between the best claims model and the default one is not very being big i figured a claims model with a reasonably ok RMSE would probably be good enough.

My pricing strategy was based around the winners curse. I fitted a large number of xgboost models throwing out parts of the data to get an idea of the parameter error in the model. I got the average of the models and applied a loading depending on the standard deviation of the estimates each model produced for a policy, so that the policies with the highest parameter uncertainty got the highest price. I was hoping this would mean i would be more competitve for the policies producing the lowest parameter error, so that if i won them there was an increased chance the model estimate was more accurate and the profit is more certain.

I also did some underwriting. I identified policies at high risk of having large claims, and other high risk categories, and deliberatly gave them a very high price to make sure i didn’t write them.

I tried some price optimising, looking at market shares, profitability and how i changed my prices over time to guesstimate the optimal market share, and appropriate level of profit loading. I guesstimated a loss ratio range of 85-105% for my final submission depending on the claims, probably not good enough to win, but at least i learnt quite alot along the way.

Best of luck everyone!


What seems to have worked for me is the binning and capping of numerical variables usually in 10 approximately equal buckets. I didn’t want the model to overfit on small portion of the data. (split on top_speed 175 and then a split on top speed 177. Which would mean basically one hot encoding the speed 176).

I also created indicators variables for the 20 most popular cars. I wasn’t sure how to do target encoding without overfitting on the make_model with not a lot of exposure.

I created a indicator variable for weight = 0. Not sure what those were but they were behaving differently.

For the final week, at the cost of worsening my RMSE a bit on the public leaderboard, I included real claim_count and yrs_since_last_claim features (by opposition to claim discount which is not affected by all claims). Fingers crossed that this will provide an edge. It was quite predictive, however will only be available for ~60% of the final dataset. And the average prediction for those with 0 (which would be the case for the ~40% new in the final dataset) was not decreased by too much… The future will tell us.

Since I was first in the week 10 leaderboard, I decided not to touch the pricing layer. Didn’t want to jinx it.


That is smart, it for sure can help decrease some instability in the profit. I clearly didn’t spend enough time in those analysis.

1 Like

in OP, @simon_coulombe said:

I didnt create any variable dependend on the previous years, because (to this day) I still don’t know if we get years 1-5 for “new business” for the final leaderboard or just year 5 data. I assume it’s only year 5

Hmmm, good point. I assumed we would be getting the history.
The RMSE calculation is pretty clear that it does include it, however the final evaluation is more ambiguous…

The final test dataset, where the final evaluation takes place, includes 100K policies for the 5th year (100K rows). To simulate a real insurance company, your training data will contain the history for some of these policies, while others will be entirely new to you.

(Emphasis mine)

Given that RMSE was clear, and this bold phrase in the final dataset, I expected we’d get historical data points.

I’d like if the organizers can clarify this?


There’s been lots of talk where they said this would be to represent that an insurer gets renewals and new business where you don’t know the history. We’ll see :slight_smile:

1 Like

Thanks for everyone sharing their approach!


  1. Similar to Simon I have a vh_current_value which is exponentially decayed yearly with factor 0.2 and a floor value of 500
  2. Claim history:
  • Aggregated total claim count, total claim amount and years from last claim (10 if no claims before)
  • Change in no claim discount, number of years with no claim discount increased
  1. Interaction variables (not all but some)
  2. Binning (Good for GLM as it is quite sensitive to outliers)
  3. I dropped the vh_make_model as I think the vehicle information is mostly reflected by vh_value, vh_weights etc., the noise-to-information ratio is too high for that
  4. I grouped Med1 with Med2 as they are very similar
  5. Population per town surface area ratio
  6. Some log-transform / power transform of numerical variables

I use same feature sets for large-claim detection model and claim estimation model.

Large Claim detection model:
A XGBoost and Logistic regression model to predict whether a claim would be >3k.

Claim estimation model:
I stacked 7 base models using a Tweedie GLM as the meta-learner under 5 fold CV.
Base models:

  1. Tweedie GLM
  2. Light GBM
  3. DeepForest
  4. XGBoost
  5. CatBoost
  6. Neural Network with Tweedie deviance as loss function
  7. A neural network with log-normal distribution likelihood as loss function (learning the mu and sigma of the loss)

Price = (1 + loading) * (estimated claim) + fixed_loading
If predicted to be large claim, loading = 1
If not: loading = 0.15
fixed_loading = 5

Since I filter out most of the predicted large-claim policies, my average premium is quite low (~65). So the estimated profit ratio is about 15% + (5/65) = ~22%.


Right, but for those policies that we know, question is:
are we being fed the 1-5 years in the preprocess function, or are we only given year 5.

1 Like

Just to clarify this one, you are only given year 5. The test data will only include 100K rows all with year = 5.

If we run into errors because of this we’ll let you know!


Hmm, bugger! I assumed we’d get access to the previous 4 years, akin to the RMSE leaderboard.

I played it safe and included a csv with the number of claims per id_policy (


I assumed we would only get fed year 5 data, so consciously made decisions for the preprocessing step and modeling (I would’ve submitted a different model if we were going to get years 1-4 in the final evaluation). I assume the only way you get years 1-4 is if you save them with your submission, but even then you can only do it for the 57k policies in the training data.


That’s interesting!
My quick understanding from this is that you sell very expensive policies to people other insurer deemed “too risky” but that you hope are worthy of a second chance.

There’s definitely lot of money there, and I left it on the table. If anyone similar to you has ever made a claim, I probably won’t sell to you…

Given your profit leaderboard position it worked at least once! :slight_smile:


Same :crying_cat_face:

1 Like

I’m actually a bit miffed tbh, I get that it’s an oversight on our part given the ambiguous wording, but I’ve got a few “NCD history” features embedded in my pre-processing - why would we spend 10 weeks with one data structure (requiring the pre-processing code to calculate these features on the fly), and then have to refactor this for the final submission …

Luckily I’ve got a “claim history” data frame as part of my final “model” which was added last minute and gives some sizeable loadings (over and above my ncd change history feats), so I’ll have some mitigation from that.

I understand that this was not as clear as it could have been. Can I ask how exactly were you and @michael_bordeleau (and potentially others) expecting the final dataset to look like?

We could make a few exceptions and make it work :muscle:

1 Like

I’m guessing the admins maybe didn’t foresee the use of prior history features? This is obviously super common in actual insurance rating plans, but I can’t really think of another reason. My prior year variables were also quite predictive, but knowing that I would only have them for a little over half the final policies, I thought about having two sets of models:

  1. For the 57K in our training set, use the best models which have the prior year features, and
  2. For the other policies, use a subpar set of models trained without using any of the prior year features.

Ultimately, probably from running out of steam, I decided to just use the subpar models for all policies. I did something similar to @simon_coulombe and saved a simple list of policies with the number of years they had claims in our training data. I ended up doing some kind of out there feature engineering and with the new set of variables, got pretty close to the accuracy of the models using prior year features.
I definitely think a different process probably makes more sense for the final evaluation and I have my gripes with the preprocessing function, but I don’t think the explanation about the final data set was ambiguous though: “The final test dataset, where the final evaluation takes place, includes 100K policies for the 5th year (100K rows).


I’m at the same place as you. Seems like we had similar thinking all along this competition.

Everything was framed in a way to use all the data available.

When you submit a model, your model makes predictions for:

  1. Year 1 with access to data from year 1.
  2. Year 2 with access to data from years 1 - 2.
  3. Year 3 with access to data from years 1 - 3.
  4. Year 4 with access to data from years 1 - 4.

This is calling for us to develop features that look into the history of a client.

And so were the weekly leaderboards, (unless I misunderstood it as well).

1 Like

I feel like that wouldn’t be fair to the rest of us. Since I changed my models assuming we would just have year 5.


I expected the data to be structured in the same fashion as the RMSE leaderboard.

Therefore, to quote year 5, we would have access to year 1 to 5 in terms of underwriting variables.

When you submit a model, your model makes predictions for:

  1. Year 1 with access to data from year 1.
  2. Year 2 with access to data from years 1 - 2.
  3. Year 3 with access to data from years 1 - 3.
  4. Year 4 with access to data from years 1 - 4.

Would be:
… year 5 with access to data from years 1 - 5, on the policies that have this information.

For new business, I understand that there would not be any info.

1 Like

Hmm ok we’ll look into this and see the scale of the issue. If many people had this issue then we will definitely take action, and if any model fails because of this we’ll get in touch :+1:t2: