Ye olde library

Hey all,

Any cool papers you want to share?

@glep pointed me to this paper @arthur_charpentier4 released this week. It proposes a solution to the empirical problem where mean(predicted claims) < mean(actual claims) when using xgboost.

@nigel_carpenter and @callum_hughes also pointer to an interesting notebook, paper and accompanyting dataset:


This is a good one examining the dynamics of the winner’s curse in car insurance:


In my “to read” after I’m done with Arthur Charpentier’s paper that fixes mean(pred) << mean(actual) for xgboost… thanks!

Not a paper as such but a pricing competition I’ll always remember was the Porto Seguro competition on Kaggle. It was special in that it was the first time I recall seeing xgboost not be the major part of the winning solution. Instead the winner Michael Jahrer used denoising auto encoders (a neural network approach).

Michael won the competition by a wide margin and how he won became a big discussion topic among the Kaggle community, many people trying to replicate his approach.

While we’ve not seen his winning approach port to other competitions with similar success it does, for me, show that when data allows neural network based approaches will outperform the current ubiquitous GBM type approaches.


I also spent some time in reading Michael’s approach during this competition. In fact I also tried some NN approaches but they cannot beat XGBoost significantly.
It looks like denoising auto encoders (DAE) perform particularly well when the feature is anonymous, as it allows the model to learn the relationship between the features automatically. In this competition, the feature engineering by human still has an edge.
In the Jane street competition, DAE is a common approach as the features are also anonymous and the dimension is huge.

1 Like