I think there are quite a lot of senior actuaries / data scientists participating this competition. Would you mind sharing how the industry price a GI policy?
From the approach sharing post, I can see many of us have tried Tree, NN and even Reinforcement learning. It would be interesting to know if these approaches are being used / explored currently. From my friend (general insurance, Asia), GLM still dominates.
I work as an actuarial consultant at EY Spain. We do not normally participate in pricing projects. However, it is a topic that interests me personally and that is why I have participated in this competition.
I can tell you that Spanish insurers still mostly use GLMs. However, there are several large insurers that have started to introduce machine learning techniques for pricing. As well as in other aspects such as fraud detection.
With what I estimate that in 3 or 4 years the active use of all these models from an explainable machine learning perspective, in order to justify their use to the different national supervisory authorities.
Not an actuary, but from what I’ve seen in 3 insurance companies, the pricing is also mostly GLMs in Canada.
This is mostly due to the regulators, which vary by province. Non-pricing models such as fraud detection, churn probability or conversion rate can lean more towards machine learning and less towards interpretability.
I can only say that this apparent GLM domination did feature in a lot of our conversations in the game design. I can understand if people are hesitant in sharing this information but what I remember from early conversations was that some were using other models with a GLM wrapper because of regulatory reasons as @simon_coulombe and @RHG hint.
For example the output of an NN would become one of the GLM features and then they can tune the effect by manually scaling the GLM coefficient until the regulator is happy. That kind of thing. Though I’m not in the industry myself.
based on my working experience and in this competition, GLMs are very robust to noisy and volatile loss data due to being much less flexible in their model specification if you use main effects + carefully selected interactions that have intuitive appeal. When you use neural networks or GBMs you can pick up a lot of noise and if you aren’t extremely careful tuning them then they are worse at generalising to new data than a GLM. I think there is a lot of value in using machine learning to find insights that can be fed into GLMs to improve their accuracy, or perhaps ensembling GLMs with GBM or NN, but GLMs are very strong even by themselves, especially when using penalised regression.
I have been a pricing actuary for a few years and can give you what I saw in different companies. I may lack objectivity here, sorry in advance
Here is what I see for risk modeling in pricing:
GLMs : (Generalized Linear Models) : prediction(X) = g^{-1}(\sum_d \beta_d \times X_d) - with possible interactions if relevant. g is often a log, so g^{-1} is an exponential, leading to a multiplicative formula.
Here all the effects are linear (so of course it does not work well as soon as one wants to capture non-linear effects such as in an age variable). Below 2 example - one good and one bad - with the observed values in purple and the model in green:
It is possible to capture non-linearity by “extending” the set of input variables, for instance by adding transformed version of the variables in the data-set - so one would use age, age^2, age^3… or other transformations to the data-set to capture non-linearities. This is very very old-school, tedious, error-prone, and lacks a lot of flexibility.
GAMs : (Generalized Additive Models) : prediction(X) = g^{-1}(\sum_d f_d(X_d))) : the predictions are the sum of non-linear effects of the different variables ; you can enrich the approach by including interactions f_{d,e}(X_d,X_e) if relevant.
The very strong point of this approach is that it is transparent (possibility for the user to directly look into the model, decompose it, understand it, without having to rely on analysis / indirect look - eg PDP, ICE, ALE, …) and easy to put top productions (the models are basically tables). So they are a powerful tool to prevent adverse-selection while ensuring the low-risk segments are well priced, and are often requested by risk-managements or regulators.
However these models were often built manually (the user selects which variables are included, what shape the functions have - polynomial, piece-wise linear, step-functions, …) either through proprietary softwares or programming languages (eg splines with Python / R). For this reason GAMs are often opposed to ML methods and suffers from a bad reputation.
Newer approach allow the creation of GAM models through machine learning while keeping the GAM structure (I believe Akur8 leads the way here - but I may lack some objectivity as I work for this company ). The idea is that the ML algorithm builds the optimal subset of variables and the shape of the f_d functions to provide a parsimonious and robust model while minimizing the errors, removing all the variables or levels that do not carry significant signal. The user runs grid-searches to test different number of variables / robustness and pick the “best one” for his modeling problem. “Best one” being the models that maximize an out-of-sample score over a k-fold and following several more qualitative sanity-check from the modeler.
For instance below a GAM fitted on a non-linear variable (driver age):
Tree-based methods (GBMs, RF…) : we all know these well ; they are associated with “machine-learning”, there are very good open-source packages (SKLearn, xgboost, lightGBM…), and it is relatively simple to use them to build a good model. The drawback is that they are black-box, meaning the models can’t be directly apprehended by a user - so the models need to be simplified through the classic visualization techniques to be (partially) understood. For instance below an ICE plot of a GBM - the average trend is good but some examples, eg in bold, are dubious:
No models : a surprisingly high number of insurance companies do not have any predictive models to compute the cost of the clients at underwriting and don’t know in advance their loss-ratios. They would track the loss-ratios (claims paid /premium earned) on different segments and their conversions, trying to correct if things go too far off-track.
I have seen many firms in Europe, where a very large majority of insurers use GAMs ; most of them use legacy solutions and build these GAMs manually ; a growing share is switching to ML-powered GAMs (thanks to us ).
There is a lot of confusion as insurers tend to use the term “GLM” to describe both GLMs and GAMs.
A minority of insurers - usually smaller and more traditional ones - use pure GLMs or no models at all.
Many insurers considered using GBMs in production but did not move forward (too much to lose, no clear gain) or leverage only GBMs to get insights relevant for the manual productions of GAMs (for instance identifying the variables with highest importance or interactions). I have heard rumors of some people did move forward with GBMs but didn’t hear about anything very convincing.
In the US the situation is a bit less clear, with a larger share of insurers using real old linear GLMs with data-transformation, some using GAMs and rumors on GBMs, either directly or as scores that enter in GLM formulas. The market is heavily regulated and varies strongly from one state to another, leading to different situations. In Asia (Japan, Korea…) I met people starting to use GAMs ; the market is also very regulated there.
Thanks @davidlkl !
Yes this area of research is quite active (“interpreting” a black box is nice, building a transparent one from the beginning is better).
I didn’t know about the HKU work: thanks for sharing!
For the “no models” part, your description is exactly what I heard from my friends working in this field. Some insurers may simply just take a discount based on the competitors’ pricing…
Of course with anti-discrimination legislation having evidence of a factor does not mean you can or are allowed to use it and that is before you look at ethics.
This also means that pricing you cannot control only allows deniability for some but may go wrong.