RMSE leaderboard validation data

stochastickang · February 5, 2021, 5:31am

Hi,

I’m aware that the data set used for RMSE leaderboard consists of 5k policies (20,000 records).

Would we be able to know what the baseline claims cost is for this set? (ie RMSE based on average claims cost on all policies and cost on policies with claims)?

alfarzan · February 5, 2021, 9:12am

Hi @stochastickang and welcome to the discussion !

If you are asking what the mean value for the claim is for the RMSE data, I’m afraid I can’t reveal that information here (though I’m sure there are ways to figure that out).

However, if you are asking what a constant model, predicting the mean claim value, would score on the RMSE leaderboard, then the answer is actually around 502.4. This is one of our baselines that exist in the zip templates and the notebooks accessible here.

EDIT: fixed broken link.

stochastickang · February 5, 2021, 12:06pm

Thanks for your response.

Could you clarify what you mean by a constant model? Is it referring to a naive model predicting some arbitrary constant value for all the claims cost?

Also, the link for the zip templates and notebooks for the baseline doesn’t seem to be working… Could you verify this please?

Thanks

alfarzan · February 5, 2021, 12:31pm

Ah sorry, for some reason the link on the post doesn’t seem to work. I will edit to fix it

In the meanwhile, to clarify:

What I mean by a constant (mean) model

This is a model that memorises the mean value of the claim_amount column for all of the training data, and uses that as the estimated claim for every new policy it encounters.

Where are the baselines?

We have 2 baselines:

Mean model. This can be found in both of the colaboratory notebooks (for R and for Python) as well as on the zip template gitlab repository. All should be linked on the submission page (just doub.
Logistic regression baselins. Both of these are linked in the colaboratory notebooks above. I am linking them separately here as well for R and for Python.

lolatu2 · February 22, 2021, 3:51am

For the two baseline models, is the mean model the one that’s slightly worst at 504.241?

alfarzan · February 22, 2021, 11:04am

Yes that one is the mean model. It is basically the case if you take the template notebooks, add your API key and then submit as is. That’s what you should get