R Starterpack (499.65 on RMSE leaderboard - 20th position) {recipes} + {tweedie xgboost}

Hey all,

During Christmas break I created this R starterpack to help people get started with submitting code for an xgboost.

I used a “tweedie” model, might be new for some folks. It allows you to model claim amount directly and is an alternative to using a frequency + a severity model

I also used the “recipes” package, which insure that you won’t create extra dummy variables by mistake.

I purposefully set the hyperparameters to something absolutely stupid. I also didnt do any clever feature engineering.

Then it got me my highest RMSE score. It was like 3rd position for RMSE back then, so I backed up on my original project of sharing it and forgot about it until today. :slight_smile:

I removed my trained_model.xgb , so you will have to at least re-run fit_model and re-zip everything before submitting. I’d also consider implementing a better pricing strategy. It simply adds a 1$ markup :slight_smile:
Anyway here it is: https://github.com/SimonCoulombe/aicrowd_insurancepricing_starterpack


:scream: I’m going to have to dig into this. Thanks so much!


Hey @simon_coulombe, thanks for sharing!

I did venture into the tweedie as well.

And also, having suffered from various forms of leakage throughout my project,
I was debating with myself on the following…

Would you consider using a variance power different than 1.50 as leakage?
Curious on your opinion on that.


For those not aware, for this competition, the Tweedie parameter should be between 1 and 2.

1.5 being a safe reasonable choice.

1 would be a poisson model
2 would be a gamma model
You want a mixture of the two…

One could select the parameter that best fits the training data. There are tools available to determine it.


My carefully handcrafted model that I’ve been using since week 1 uses a tweedie parameter I estimated using tweedie::tweedie.profile().

I also have bayesian hyperparameter tuning, cleverly crafted features and capping of large losses.

I guess you could tune the tweedie parameter at the same time as the xgboost hyperparameters.

Its RMSE is 499.713 (vs 449.655) for the stupid model.

That’s when I gave up. :slight_smile:

I know at least a couple of the models above me are just GAMs.


it looks like you forgot to bake the x_raw when you made predictions.

edit:my mistake, it’s done in the preprocess

1 Like

I’m embarrassed to say that I just used up my submission limit today to try different seeds… and @simon_coulombe, you got really lucky with the seed :smiley:. I guess 42 is the answer to everything.


haha! glad you’re having fun with it.

I’m not surprised it was a lucky draw. I don’t use it, I just wanted to share how to make a submission using zip file and do a quick shout-out to the tweedie model. At the very least you’ll want to do some feature engineering, cap the outliers and find some decent hyperparameters.

Trust your CV, no need to submit to the leaderboard all the time :slight_smile:


1 Like

@michael_bordeleau , regarding the tweedie parameter and leakage. I’m not a thinker, but if it’s good enough for @arthur_charpentier4 then it’s good enough for me.

source: Computational Actuarial Science with R (Arthur Charpentier 2015)

1 Like

Thanks for the feedback!

I was at a point were it was clear I had leakage somewhere in my project. And that’s when I started doubting everything I had done.
Made a checklist and decided what could be material or not.


For anyone interested, I recommend this nice lecture on machine learning pitfalls (leakage).

I laugh out loud at the 1hour mark.

“If you torture the data long enough, it will confess”

2021-02-16 10_39_16-(345) Lecture 17 - Three Learning Principles - YouTube


haha! that one is a classic among economists.
full quote is “If you torture the data long enough, it will confess to anything”

1 Like

I tried using your xgboost parameters on my processed data with a lot of complex feature engineering, and it didn’t perform any better. I tried ensembling your xgboost with my other base models… and nothing is beating your original xgboost. Craziness…(I’m amusing myself with the RMSE leaderboard since at least it gives immediate feedback. I’m about out of ideas for the pricing strategy and none of what we tried worked well.)

1 Like

A few years ago a member of my team, who is an actuary and Kaggle master, used to delight in tormenting the rest of us when his models out-performed ours on leaderboards.

He would always insist his out-performance was due to his process of selecting lucky seeds.

We all knew full well there’s no such thing, that generalises to the private leaderboard, but many a time I caught myself trying a few different seeds to see if I can get lucky and beat his model.

There is of course nothing wrong in running a few models with different seeds and taking the average result. That’s a recognised technique called bagging which will often improve a model at the cost of implementation complexity.


now you know why I gave up early. Spend a few hours being “clever” , then being absolutely destroyed by a purposefully stupid model.

@simon_coulombe little OT … will you change now the model from Xmas after the performance in the last two weeks? :slight_smile: just curious!

haha thanks for asking!

Its honestly not a very strong model. I’m pretty sure it was only doing fine until people caught up that they needed to increase their prices a bit.

I’ll probably adjust my profit margin for the last week, but I don’t think I’ll revisit the model too much. I’m very happy that I managed not to sink too much time on this competition and would rather keep it that way.

I have spent a lot of tiem THINKING about it, but it doesnt feel as bad :slight_smile:


Even though I know there isn’t any reason to do so, I’m so upset that I can’t beat this model…


cue me giving up a few months ago.

1 Like