I’d like to understand better what is in the test data.
-Is it a single row for year 5 for the same “id_policy” that are in the training set?
-Is it a single row for year 4 for new id_policy that are not in the training set
-Is it a single row for year 5 for new id_policy that are not in the training set?
-Is it a bunch of rows (year 1-4 with claim_amount and year 5 without) for new id_policy that are not in the training set and you have to provide a premium for year 5 ?
The test data is a single row for year 5 with a mixture of
id_policy values that are in the training data and are not.
The idea is that, like a real insurance company, the incoming portfolio for a new year will be a mixture of new customers and those for whom you have a history.
I hope this has clarified things a little more.
For some rows in test data we have claim_amount for past 4 years (from training data). But RMSE and weekly leaderboard data sets have separate policies from training set and historical claim_amount is not available to them?
Yes, RMSE and the weekly leaderboards have separate datasets and there is no historical claim information about these contracts available to you.
To clarify what the leaderboard datasets look like see below.
RMSE leaderboard data
The data in this leaderboard is a uniform sample similar to your training data. It contains the data of 5K policy holders tracked over 4 years (20K contracts in total). The aim for this is to give you a general idea of how well your model performs on claim estimation.
Weekly average profit leaderboards
These are 9 weekly leaderboards using the data of approximately 15K policy holders over 4 years (60K contracts in total). Each week the leaderboard will contain a sample of these 60K contracts equally distributed among the 9 weeks.
In addition we have made it so that the year of the policies in question generally increases with each weekly leaderboard. So for example, the first weekly leaderboard this Saturday, will contain mostly contracts with
year = 1 in the data while the last weekly leaderboard in late February will contain mostly contracts with
year = 4. The final test data will contain information about 100K policy holders all with
year = 5 as noted previously in this thread.
I will update the over view page with some of this information shortly and please let me know if this doesn’t answer question.
I haven’t seen the test dataset, where is it?
In reality, an insurance company usually does not know what the incoming years’ portfolio will look like. They will price policies, based on their model, as and when they come in the new year. This is done without advance knowledge of the make-up of the whole portfolio.
To simulate this, we don’t provide you with the test dataset, nor do we provide you with the leaderboard datasets. This is one of the reasons that in this competition you are asked to provide your model code.
We use your model code on the test and leaderboard datasets to generate:
- Premium prices that are used in the profit leaderboard and final competition evaluation. This uses your
- Claim value estimates that are used in the RMSE leaderboard. This uses your
Please let me know if this doesn’t answer your question
thanks @alfarzan ,
Thanks for the reply. This is what I initially understood, but @alexander_penkin had me confused because of "For some rows in test data we have claim_amount for past 4 years "
I guess his point was that:
- In the test data (year 5) you’ll be able to have information of what happened on the 4 first years thanks to the training data and id_policy.
- In the RMSE leaderboard, you will have no same id_policy since it’s only 4 years, so you cannot use this information.
=> you can use past_claims in profit_leaderboard but not in rmse_leaderboard…
Are we allowed to know the proportion of test rows that are
id_policy values that are in the training set? I thought I read in one of the other threads that it was 60:40 in one direction or the other, but I can’t find that now and it’s not on the front page.
Yes it’s not explicitly written. The final test set has 100K policy IDs with
year = 5. Your training data consists of 60K policies with years 1 - 4. The 5th year for all of these policies appears in the test.
On the dataset section on the overview page this is described in detail if I’m correct?