Now that I have my final submission in and I’m just waiting for the results, my weekend is feeling quite empty. Sooo… I figured I’d start a discussion on some thoughts about the competition while we’re all still pretty engaged. Tbh, in 3 weeks, I will probably forget a lot of these so jotting them down somewhere is probably good .
/… Thoughts about technical aspects
For sure I think this was a challenge to a lot of participants. This has been largely mitigated by how responsive and helpful the various admins have been. to the folks at AIcrowd and Imperial College London.
I think a lot of the technical aspects have now been addressed/discussed, and the admins have a good idea of what should change to make the process easier, so I won’t repeat here (i.e. save model issues, different package versions allowed in the evaluation environment). My main pain point was with the preprocessing function. It doesn’t fit many of the major modeling frameworks to have the preprocessing not include the ydata. For example, @simon_coulombe’s xgboost start kit uses R’s tidymodels and he’s basically saving the preprocessing as an R object (rds) outside of the required functions and loading the object in the required preprocessing function. This separation of xdata and ydata also doesn’t work for steps like target encoding. By the way @simon_coulombe, I finally beat your xgboost model’s RMSE score with a not too complicated model… It really made my day .
Thoughts about data
I was super impressed with the data anonymization. Motor insurance is one of the insurance segments that you can still do a lot with modeling anonymized data. This would not be true for say Homeowners, where specific address is pretty key to modeling expected loss.
Predicting the Year 5 data is a cool idea. I wonder if some sort of customer churn idea could be introduced so there is no “data leakage”, since in the current setup, we know that all of the policies we are modeling have been with the insurance company for 5 years.
Another gripe I have with the preprocessing function is that the way this is setup, when the preprocessing is run Year 5 data, none of my engineered features based on prior years’ history will work. So, my only option is save a copy of the ~60K policy histories with my submission models in order to use the prior year features. I can’t do anything about the 40K policies that were in the RMSE/weekly profit, even though I should have access to their Years 1-4 history to predict Year 5.
It would be great if in addition to the claim_amount, we were also provided claim_count in the training data. Generally, frequency/severity models in the industry would look at counts of unique claim occurrences, not counts at the policy-year level. For example, it’d be helpful to see how many different claims added up to the claim_amount that we have for each policy-year.
The reinsurance aspect is interesting, but reinsurance is not typically at the individual policy level for motor insurance (at least not in the US, where I’m based) for personal auto. I understand the reason the capping was done, but no need to call it “reinsurance” . When the capping at 50K was introduced, it was not obvious to me that the training data was also changed. I have been running my models in Google Colab and it took me a full week to figure out that my models were changing because the training data I was downloading was changed to cap losses at 50K. Depending on the model, introducing this cap helps some and hurts others. I’d probably rather have the uncapped losses and let the participants decide if they want to cap to train their models, knowing that the evaluations will cap the losses. In my code, I actually have a cell that “uncaps” the losses back to the original training data that was available for download before the “reinsurance” introduction.
Thoughts about the weekly feedback
I think this is probably the area that has most room for improvement. The feedback is quite different from what a real insurance company would get from the market. The feedback has improved over the weeks, but this improvement was at the expense of consistency. Since this is a market based game, the consistency from the perspective of what do I think other competitors are reacting to is an important element of the game. Introducing new elements in the feedback makes this really challenging to think through.
All in all, I thought this was a great competition and I’m really glad I participated. I learned so much on modeling and how to consider pricing strategies. I’m very grateful for all the work that the admins put in, and props for coming up with such an intriguing competition!
I would disagree about Thoughts about data 1. Homeowners in at least in one country I know is much easier to model claims well compared to motor…
Motor which generates more premiums and competition is much harder in my opinion to model the claims especially as there is so much legislation etc in the background and professionals oversimplify the modelling.
@chezmoi I think @lolatu2 is referring more to the fact that if you try to remove personal information such as the address from property related data (e.g. home insurance), then you lose a lot of the signal, but in motor this is easier to do.
Am I the only one that thought about some external data, which I expect could bring in significant improvement in predictive power?
I explored that path (kind of seriously), but ultimately gave up because I’m cheap and felt it might be considered cheating , (even though I don’t recall seeing any restrictions on using external data).
What external data did you have in mind? I thought about using some sort of external data for trend and geography, but those attributes are pretty well anonymized. On the “rating variable” factors, those are calibrated to the rating plan, so they can’t easily be exported to be used in a different model.
Indeed, I figured there’s a few attributes you could possibly join ext data into if you found it, but I also gave up after an hour or so searching for it.
Ah, given that there’s barely 24 hours left to this competition…
I’ll just say that I’m not responsible for anyone sleep deprivation, and its consequences.
We were given anonymized vehicle makemodel, manufacturer weight and speed, and value.
There are databases out there, of vehicle information. You could use weight and speed as merging keys.
Further data cleaning would be required to ensure a good merge with vehicle value.
Moreover, you could cross-check reasonableness by comparing the popularity of the vehicle so that the appearance frequency in training data reflects the sale numbers of such vehicle in Europe…
Example:
rthsjeyjgdlmkygk is a Renault Clio?
Now merged, you have plenty more info on the vehicle, such as engine specs. The quantity of additional vehicle specs is what an xgboost model would love to have!
Hah, that’s a great idea! I considered the popularity aspect but assumed that this differs quite a bit by county in Europe? I didn’t dig deeper into it. (I’m not staying up late to figure this out ). Cheers to everyone still working on their submission. Good luck everyone!
An idea: you can use a logistic regression for “has a claim during the year” instead of using a poisson for “has N claims during the period”. This is possible because all policies have the same exposure (1 year).
Disclaimer : I haven’t tried it.
I’ve been thinking about target encoding too. Haven’t found an easy way to do it using {tidymodels} so I used a different approach.
I’m also wary about target encoding… @michael_bordeleau will remember how it sabotaged models back in a previous competition.
…Talking about sabotage. There’s is a big trap in my starter pack I noticed earlier today. New unknown cars are given “NA” values instead of “other” value and the predicted claim is around 25$ instead of 100$ when that happens. I think the fix is that step_other() has to go before step_string2factor(). Who knows, maybe that’s what will bring me back to profitability.
I had the same idea, I manually found two cases which was I believe the Chevrolet Aveo and the Nissan Micra. I thought it was too much time and I’m not sure where I would go to source an european vehicle database.
I think even without geography there is plenty you can do. That said it really depends on what you want to do in terms of pricing and more imprortantly how you sell the insurance if at all.
@simon_coulombe
An idea: you can use a logistic regression for “has a claim during the year” instead of using a poisson for “has N claims during the period”. This is possible because all policies have the same exposure (1 year).
Disclaimer : I haven’t tried it.
I’ve tried this (logistic and gamma instead of poisson and gamma) for my initial baseline model with simple groupings of selected variables. It gave RMSE of around 500.6… which was better than constant baseline model from competition. But this ranked like 180th on the leader board…
Actually my final model that beat @simon_coulombe’s xgboost starter kit was a xgboost “frequency” classifier and xgboost (gamma) severity model. It got an RMSE of 499.553 .
If you were an insurer with a British subsidiary you might have access to ABI/Thatcham data (back in the day). The cost of getting this though otherwise…