It's (almost) over! sharing approaches

Hey all!

I thought I’d start a thread to share ideas and approaches. I believe it’s been very hard to keep quiet for a few of us. :slight_smile:

I just added ideas as I had them and I didnt work much on model selection, because that felt like work. @alfarzan might say I went with a kitchen sink approach.

My model is very similar to the starterpack I shared earlier. Here is the final version repository. Honestly this won’t be complicated as I spent a lot of time thinking but not much time working :). It’s a single model, xgboost, tweedie.

Did you try GAMs ? poisson * gamma? logistic * gamma?

Feature engineering

Here are some feature engineering ideas I liked:

a) create a “town_id” variable by concatenating the population and surface area. Then you can treat it like any other categorical vairable with many possible values, like target encoding or something else.
b) create a “vh_current_Value” variable by depreciating cars 20% every year.
c) I tried to make a “known perfect bonus-malus values” indicator. According to the dictionnary, you started at a precise value and decreased by a specific amount every claim-free year. If you had a claim, your bonus-malus increased by a value that set you off the “perfect track” forever. This would have been useful for new business. In practice, having a claim only increased the bonus-malus value by 1 year, so it didnt work.

I didnt create any variable dependend on the previous years, because (to this day) I still don’t know if we get years 1-5 for “new business” for the final leaderboard or just year 5 data. I assume it’s only year 5, so I didnt want to be to dependent on counts (like increases of the “bonus-malus” variable, or city/car changes)

train_my_recipe <- function(.data) {
  my_first_recipe <-
      claim_amount ~ .,
      .data[0, ]
    ) %>%
      light_slow = if_else(vh_weight < 400 & vh_speed < 130, 1, 0, NA_real_),
      light_fast = if_else(vh_weight < 400 & vh_speed > 200, 1, 0, NA_real_),
      town_id = paste(population, 10 * town_surface_area, sep = "_"),
      age_when_licensed = drv_age1 - drv_age_lic1,
      pop_density = population / town_surface_area,
      young_man_drv1 = as.integer((drv_age1 <= 24 & drv_sex1 == "M")),
      fast_young_man_drv1 = as.integer((drv_age1 <= 30 & drv_sex1 == "M" & vh_speed >= 200)),
      young_man_drv2 = as.integer((drv_age2 <= 24 & drv_sex2 == "M")),
      # no_known_claim_values = as.integer(pol_no_claims_discount %in% no_known_claim_values),
      year = if_else(year <= 4, year, 4), # replace year 5 with a 4.
      vh_current_value = vh_value * 0.8^(vh_age - 1), # depreciate 20% per year
      vh_time_left = pmax(20 - vh_age, 0),
      pol_coverage_int = case_when(
        pol_coverage == "Min" ~ 1,
        pol_coverage == "Med1" ~ 2,
        pol_coverage == "Med2" ~ 3,
        pol_coverage == "Max" ~ 4
      pol_pay_freq_int = case_when(
        pol_pay_freq == "Monthly" ~ 1,
        pol_pay_freq == "Quarterly" ~ 2,
        pol_pay_freq == "Biannual" ~ 3,
        pol_pay_freq == "Yearly" ~ 4
    ) %>%
    recipes::step_other(recipes::all_nominal(), threshold = 0.005) %>%
    recipes::step_string2factor(recipes::all_nominal()) %>%
    # 2 way interact
    recipes::step_interact(~ pol_coverage_int:vh_current_value) %>%
    recipes::step_interact(~ pol_coverage_int:vh_time_left) %>%
    recipes::step_interact(~ pol_coverage_int:pol_no_claims_discount) %>%
    recipes::step_interact(~ vh_current_value:vh_time_left) %>%
    recipes::step_interact(~ vh_current_value:pol_no_claims_discount) %>%
    recipes::step_interact(~ vh_time_left:pol_no_claims_discount) %>%
    # 3 way intertac
    recipes::step_interact(~ pol_coverage_int:vh_current_value:vh_age) %>%
    # remove id
    step_rm(contains("id_policy")) %>%
    # recipes::step_novel(all_nominal()) %>%
    recipes::step_dummy(all_nominal(), one_hot = TRUE)
  prepped_first_recipe <- recipes::prep(my_first_recipe, .data, retain = FALSE)

Feature importance table is here

Exporting test data?

I considered using the decimals in the price to export the test data. For example,103.12125$ could mean I charged 12% profit margin, “1” means man and “25” means “25 yo old”. We never were provided with enough information (and that’s a good thing, because that’s a nasty hack hehe).

Random pricing

I used the same claims model for weeks 1 to 10. For pricing, I started with a 20% profit margin and a minimum price of 25$. Then I started using a random profit margin between 1 and 100%. The % was determined by the first 2 decimals of the predicted claims. If the predicted claim was 92.12$ , then I would apply a 12% profit margin to reach a price of 103.13$. I would then replace the first 2 decimals with the profit margin so that I remember what I charged (thus 103.12$). The idea was that if we received feedback at the policy level I could know after a week what my market share would be at any possible profit margin. This was useless, because we never got market feedback at the policy level.

The random profit margin model earned a podium twice, which I found very funny.

Random pricing did have a use. Looking at the conversion_rate and the average profit of the often/sometimes/never sold quotes for the weeks using random profit margins, I have a feeling that around 20% was a good number. For example, if you look at my “financials by conversion rate” table in the week 8 feedback, then you see that the higher the profit margin, the less often I sell a quote (obviously). You also see that my highest profit per policy was for the policies sold “sometimes”, and the average profit margin for that group was 21%.

Financials By Conversion Rate, week 8

## # A tibble: 4 x 6
##   `Policies won`      claim_frequency premiums conversion_rate `profit per poli…
##   <chr>                         <dbl>    <dbl>           <dbl>             <dbl>
## 1 often: 34.0 - 100.…            0.09     92.9            0.74            -24.2
## 2 sometimes: 1.4 - 3…            0.1     132.             0.11              1.9
## 3 rarely: 0.1 - 1.4%             0.1     143.             0.01              0.19
## 4 never:                         0.11    221.             0                 0  
## # … with 1 more variable: profit_margin <dbl>

In the final weeks I tried a random profit margin between 20-45%, the idea being to try to get a few “very profitable policies” by trying to sell at 30+% profit margin, while also ensuring that I sell at least a few policies and remain on the leaderboard thanks to the 20-30% profit margin on half the quotes. That didnt work very well.

My model has been profitable for the first half of the competition with a very small market share, until people caught on that you needed to have high profit margins.

Weekly feedbacks

HEre are the links to my weekly feedbacks

week 10 (20-45% profit margin)
week 9 (20-45% profit margin)
week 8 (1-100% profit margin)
week 7 (20% profit margin)
week 6 (1-100% profit margin)
week 5 (20% profit margn)
week 4 (20% profit)
week 3
week 2
week 1 (20% profit)

For the final week, I saved the number of claims each policy_id has had during the 4 years. Each number has a markup. 1 claim is about 25%, 3 claims is about 120% … and 4 claims I just dont want you (multiply claims by 10000 to get price). New business are charged 5% more than claims-free renewal because I don’t know how many claims they have


@simon_coulombe Awesome but before I comment, I should mention that the competition is open for another 4 hours!

If you are aware then great, and I’ll come back to this soon for an in-depth read :eyeglasses:


title page says “competition over” lol
i’ll repost later

1 Like

Thanks for the heads up, fixed now!

I’ll share my approach for sure


I ended up just using elastic net GLMs and weighting together an all-coverage GLM with by-coverage GLMs for all coverages, it ended up being about 50/50 for each one except Max which just used the all-coverage model. Some important features that I engineered:

(To create these features on year 5 data I saved the training set within the model object and appended it within the predict function , then deleted the training observations after the features were made. )

When you group by policy id, if the number of unique pol pay freq is > 1 it correlates with loss potential.

These people had some instability in their financial situation that seems correlated with loss potential.

When you group by policy id, if the number of unique town surface area > 1 it also correlates with loss potential.

My reasoning for this was that if someone switched the town they live in then they’d be at greater risk due to being less familiar with the area they are driving in.

I used pol_no_claims_discount like this to create an indicator for losses in the past 2 years:

x_clean = x_clean %>% arrange(id_policy,year)

if(num_years > 2){
for(i in 3:num_years){
df = x_clean[x_clean$year == i | x_clean$year == i - 2,]
df = df %>% group_by(id_policy) %>% mutate(first_pd = first(pol_no_claims_discount),
last_pd = last(pol_no_claims_discount)) %>% mutate(diff2 = last_pd - first_pd)
x_clean[x_clean$year == i,]$diff2 = df[df$year == i,]$diff2
x_clean$ind2 = x_clean$diff2 > 0, then ind2 goes into GLM.

In practice, we know that geography correlates strongly with socioeconomic factors like credit score which indicate loss potential, so I thought the best way to deal with this was to treat town surface area as categorical and assume that there are not very many unique towns with the same surface area. I don’t think this is fully true since some towns had really different population counts, but overall it seemed to work and I couldn’t group by town surface area and population because population changes over time for a given town surface area. So:

I treated town surface area as categorical and created clusters with similar frequences which had enough credibility to be stable across folds and also across an out-of-time sample.

Since Max was composed of many claim types, I created clusters based on severity by town surface area, the idea being that some areas would have more theft, some would have more windshield damage, etc.

I created clusters of vh_make_model and also of town_surface_area based on the residuals of my initial modeling. I also tried using generalized linear mixed models but no luck with that.

Two-step modeling like this with residuals is common in practice and is recommended for territorial modeling.

Since pol_no_claims_discount has a floor at 0, an indicator for pol_no_claims_discount == 0 is very good since the nature of that value is distinct. Also, this indicator interacted with drv_age1 is very good, my reasoning being that younger people who have a value of 0 have never been in a crash at all in their lives, and that this is more meaningful than an older person who may have been in one but had enough time for it to go down. This variable was really satisfying because my reasoning perfectly aligned with my fitting plot when I made it.

Other than these features I have listed, I fit a few variables using splines while viewing predicted vs. actual plots and modeled each gender separately for age. I modeled the interaction mentioned above using a spline to capture the favorable lower ages,

The weirdest feature I made that worked although wasn’t too strong was made like this:
By town surface area, create a dummy variable for each vh_make_model and get the average value of each one, then use PCA on these ~4000 columns to reduce the dimension. The idea behind this is that it captures the composition of vehicles by area which may tell you something about the region.

Another weird one that I ended up using was whether or not a risk owned the most popular vehicle for the area.

x_clean = x_clean %>% group_by(town_surface_area) %>% mutate(most_popular_vehicle = names(which.max(table(vh_make_model)))) %>% mutate(has_most_popular = vh_make_model == most_popular_vehicle)

I used 5-fold cross validation and also did out-of-time validation using years <=3 to predict year 4 and using years <= 2 to predict years 3 and 4.

For my pricing I made some underwriting rules like:

I do not insure Quarterly (meaning I multiply their premium by 10), AllTrips, risks who have a male secondary drive for coverages Max or Med 2, anyone who switched towns or payment plans ( even if one of them wasn’t quarterly), I made a large loss model for over 10k and anyone in the top 1% probability isn’t insured, and anyone with certain vh_make_models which had a really high percentage of excess claims, or people in certain towns that had really high % of excess claims. These last two were a bit judgemental because some of the groups clearly lacked the credibility to say they have high excess potential, but I didn’t want to take the chance.

I didn’t try to skim the cream on any high risk segment, but it is reasonable that an insurer could charge these risks appropriately and make a profit while competing for them. Did any of you try and compete for all trips or Quarterly with a really high risk load?

Finally I varied my profit loading by coverage and went with this in the end:

Max - 1.55
Min - 1.45
Med1 - 1.4
Med2 - 1.45

Good luck to all!


I would like to express my sincere thanks to the organisers. This competition for me is a game and I love game theory.

Here is my approach.

Machine model: Very simple

  • Features: All features as they are except for “vh_make_model” variable. For this feature, I use OOF target encoding.
  • Frequency model: RandomForestClassifier
  • Severity models: 2 GradientBoostingRegressor models with quantile loss 0.2 and 0.8
  • Claim = Frequency * (Severity 0.2 + Severity 0.8) / 2

Pricing strategy: This part is the one on which I spent most of my time.
Price = a * Claim + b

I did not do any simulations but changed these 2 elements (a, b) very single week. For week 10, I was at the 5th position, a = 1.04 and b = 100.

I started with 1, 10, 50, 100 but the lower b is the less profitable I am. For the final round a = 1.05 and b = 100. Hope that the best is to sell expensive. :slight_smile:


Thank you for sharing your ideas! It is very enlightening and for me this is the the most important aspect of this competition.

Here are some of my approaches:
Feature engineering
Just one thing I haven’t seen being discussed yet: I used kinetic energy (proportional to mass * velocity^2) which proved to give some gain in gradient boosting models. The idea being that this should correlate to the damage a car may cause.

I went with a NN model that maximizes profit w.r.t. price increase in the form

profit(price increase) = (expected claim + price increase) * (1-probability of winning a contract)

Obviously, I had to make assumptions about the “price elasticity” = (1-probability of winning a contract) part of the equation.
I was surprised how well this model took care of assigning higher price increases (relative to expected claim amount) for policies with lower claim probabilities. Meaning that if two policies have the same expected claim amount but one of them has a lower claim probability (and thus higher severity) - that policy deserves a higher price increase as the variance is higher.

Unfortunately, I joined somewhat late and it took weeks 7 and 8 before I learned that the average price loading has to be some 20-40% in order to make profit.

Good luck to all of you!


When the differece in RMSE between the best claims model and the default one is not very being big i figured a claims model with a reasonably ok RMSE would probably be good enough.

My pricing strategy was based around the winners curse. I fitted a large number of xgboost models throwing out parts of the data to get an idea of the parameter error in the model. I got the average of the models and applied a loading depending on the standard deviation of the estimates each model produced for a policy, so that the policies with the highest parameter uncertainty got the highest price. I was hoping this would mean i would be more competitve for the policies producing the lowest parameter error, so that if i won them there was an increased chance the model estimate was more accurate and the profit is more certain.

I also did some underwriting. I identified policies at high risk of having large claims, and other high risk categories, and deliberatly gave them a very high price to make sure i didn’t write them.

I tried some price optimising, looking at market shares, profitability and how i changed my prices over time to guesstimate the optimal market share, and appropriate level of profit loading. I guesstimated a loss ratio range of 85-105% for my final submission depending on the claims, probably not good enough to win, but at least i learnt quite alot along the way.

Best of luck everyone!


What seems to have worked for me is the binning and capping of numerical variables usually in 10 approximately equal buckets. I didn’t want the model to overfit on small portion of the data. (split on top_speed 175 and then a split on top speed 177. Which would mean basically one hot encoding the speed 176).

I also created indicators variables for the 20 most popular cars. I wasn’t sure how to do target encoding without overfitting on the make_model with not a lot of exposure.

I created a indicator variable for weight = 0. Not sure what those were but they were behaving differently.

For the final week, at the cost of worsening my RMSE a bit on the public leaderboard, I included real claim_count and yrs_since_last_claim features (by opposition to claim discount which is not affected by all claims). Fingers crossed that this will provide an edge. It was quite predictive, however will only be available for ~60% of the final dataset. And the average prediction for those with 0 (which would be the case for the ~40% new in the final dataset) was not decreased by too much… The future will tell us.

Since I was first in the week 10 leaderboard, I decided not to touch the pricing layer. Didn’t want to jinx it.


That is smart, it for sure can help decrease some instability in the profit. I clearly didn’t spend enough time in those analysis.

1 Like

in OP, @simon_coulombe said:

I didnt create any variable dependend on the previous years, because (to this day) I still don’t know if we get years 1-5 for “new business” for the final leaderboard or just year 5 data. I assume it’s only year 5

Hmmm, good point. I assumed we would be getting the history.
The RMSE calculation is pretty clear that it does include it, however the final evaluation is more ambiguous…

The final test dataset, where the final evaluation takes place, includes 100K policies for the 5th year (100K rows). To simulate a real insurance company, your training data will contain the history for some of these policies, while others will be entirely new to you.

(Emphasis mine)

Given that RMSE was clear, and this bold phrase in the final dataset, I expected we’d get historical data points.

I’d like if the organizers can clarify this?


There’s been lots of talk where they said this would be to represent that an insurer gets renewals and new business where you don’t know the history. We’ll see :slight_smile:

1 Like

Thanks for everyone sharing their approach!


  1. Similar to Simon I have a vh_current_value which is exponentially decayed yearly with factor 0.2 and a floor value of 500
  2. Claim history:
  • Aggregated total claim count, total claim amount and years from last claim (10 if no claims before)
  • Change in no claim discount, number of years with no claim discount increased
  1. Interaction variables (not all but some)
  2. Binning (Good for GLM as it is quite sensitive to outliers)
  3. I dropped the vh_make_model as I think the vehicle information is mostly reflected by vh_value, vh_weights etc., the noise-to-information ratio is too high for that
  4. I grouped Med1 with Med2 as they are very similar
  5. Population per town surface area ratio
  6. Some log-transform / power transform of numerical variables

I use same feature sets for large-claim detection model and claim estimation model.

Large Claim detection model:
A XGBoost and Logistic regression model to predict whether a claim would be >3k.

Claim estimation model:
I stacked 7 base models using a Tweedie GLM as the meta-learner under 5 fold CV.
Base models:

  1. Tweedie GLM
  2. Light GBM
  3. DeepForest
  4. XGBoost
  5. CatBoost
  6. Neural Network with Tweedie deviance as loss function
  7. A neural network with log-normal distribution likelihood as loss function (learning the mu and sigma of the loss)

Price = (1 + loading) * (estimated claim) + fixed_loading
If predicted to be large claim, loading = 1
If not: loading = 0.15
fixed_loading = 5

Since I filter out most of the predicted large-claim policies, my average premium is quite low (~65). So the estimated profit ratio is about 15% + (5/65) = ~22%.


Right, but for those policies that we know, question is:
are we being fed the 1-5 years in the preprocess function, or are we only given year 5.

1 Like

Just to clarify this one, you are only given year 5. The test data will only include 100K rows all with year = 5.

If we run into errors because of this we’ll let you know!


Hmm, bugger! I assumed we’d get access to the previous 4 years, akin to the RMSE leaderboard.

I played it safe and included a csv with the number of claims per id_policy (


I assumed we would only get fed year 5 data, so consciously made decisions for the preprocessing step and modeling (I would’ve submitted a different model if we were going to get years 1-4 in the final evaluation). I assume the only way you get years 1-4 is if you save them with your submission, but even then you can only do it for the 57k policies in the training data.


That’s interesting!
My quick understanding from this is that you sell very expensive policies to people other insurer deemed “too risky” but that you hope are worthy of a second chance.

There’s definitely lot of money there, and I left it on the table. If anyone similar to you has ever made a claim, I probably won’t sell to you…

Given your profit leaderboard position it worked at least once! :slight_smile: