I’m aware that the data set used for RMSE leaderboard consists of 5k policies (20,000 records).
Would we be able to know what the baseline claims cost is for this set? (ie RMSE based on average claims cost on all policies and cost on policies with claims)?
If you are asking what the mean value for the claim is for the RMSE data, I’m afraid I can’t reveal that information here (though I’m sure there are ways to figure that out).
However, if you are asking what a constant model, predicting the mean claim value, would score on the RMSE leaderboard, then the answer is actually around 502.4. This is one of our baselines that exist in the zip templates and the notebooks accessible here.
Could you clarify what you mean by a constant model? Is it referring to a naive model predicting some arbitrary constant value for all the claims cost?
Also, the link for the zip templates and notebooks for the baseline doesn’t seem to be working… Could you verify this please?
Ah sorry, for some reason the link on the post doesn’t seem to work. I will edit to fix it
In the meanwhile, to clarify:
What I mean by a constant (mean) model
This is a model that memorises the mean value of the claim_amount column for all of the training data, and uses that as the estimated claim for every new policy it encounters.
Where are the baselines?
We have 2 baselines:
Mean model. This can be found in both of the colaboratory notebooks (for R and for Python) as well as on the zip template gitlab repository. All should be linked on the submission page (just doub.
Logistic regression baselins. Both of these are linked in the colaboratory notebooks above. I am linking them separately here as well for R and for Python.
Yes that one is the mean model. It is basically the case if you take the template notebooks, add your API key and then submit as is. That’s what you should get