What's in the test data?

simon_coulombe · December 21, 2020, 5:58pm

Hi,
I’d like to understand better what is in the test data.

-Is it a single row for year 5 for the same “id_policy” that are in the training set?
-Is it a single row for year 4 for new id_policy that are not in the training set
-Is it a single row for year 5 for new id_policy that are not in the training set?
-Is it a bunch of rows (year 1-4 with claim_amount and year 5 without) for new id_policy that are not in the training set and you have to provide a premium for year 5 ?

thanks

alfarzan · December 21, 2020, 6:12pm

Hi @simon_coulombe

The test data is a single row for year 5 with a mixture of id_policy values that are in the training data and are not.

The idea is that, like a real insurance company, the incoming portfolio for a new year will be a mixture of new customers and those for whom you have a history.

I hope this has clarified things a little more.

simon_coulombe · December 21, 2020, 6:12pm

perfect, thanks

alexander_penkin · December 22, 2020, 5:05pm

Hi,
For some rows in test data we have claim_amount for past 4 years (from training data). But RMSE and weekly leaderboard data sets have separate policies from training set and historical claim_amount is not available to them?

alfarzan · December 22, 2020, 6:53pm

Hi @alexander_penkin

Yes, RMSE and the weekly leaderboards have separate datasets and there is no historical claim information about these contracts available to you.

To clarify what the leaderboard datasets look like see below.

RMSE leaderboard data

The data in this leaderboard is a uniform sample similar to your training data. It contains the data of 5K policy holders tracked over 4 years (20K contracts in total). The aim for this is to give you a general idea of how well your model performs on claim estimation.

Weekly average profit leaderboards

These are 9 weekly leaderboards using the data of approximately 15K policy holders over 4 years (60K contracts in total). Each week the leaderboard will contain a sample of these 60K contracts equally distributed among the 9 weeks.

In addition we have made it so that the year of the policies in question generally increases with each weekly leaderboard. So for example, the first weekly leaderboard this Saturday, will contain mostly contracts with year = 1 in the data while the last weekly leaderboard in late February will contain mostly contracts with year = 4. The final test data will contain information about 100K policy holders all with year = 5 as noted previously in this thread.

I will update the over view page with some of this information shortly and please let me know if this doesn’t answer question.

simon_coulombe · December 23, 2020, 8:15pm

I haven’t seen the test dataset, where is it?

alfarzan · December 23, 2020, 8:32pm

Hi @simon_coulombe

In reality, an insurance company usually does not know what the incoming years’ portfolio will look like. They will price policies, based on their model, as and when they come in the new year. This is done without advance knowledge of the make-up of the whole portfolio.

To simulate this, we don’t provide you with the test dataset, nor do we provide you with the leaderboard datasets. This is one of the reasons that in this competition you are asked to provide your model code.

We use your model code on the test and leaderboard datasets to generate:

Premium prices that are used in the profit leaderboard and final competition evaluation. This uses your predict_premium function.
Claim value estimates that are used in the RMSE leaderboard. This uses your predict_expected_claim function.

Please let me know if this doesn’t answer your question

simon_coulombe · December 23, 2020, 9:25pm

thanks @alfarzan ,

Thanks for the reply. This is what I initially understood, but @alexander_penkin had me confused because of "For some rows in test data we have claim_amount for past 4 years "

clembrule · December 24, 2020, 5:37pm

I guess his point was that:

In the test data (year 5) you’ll be able to have information of what happened on the 4 first years thanks to the training data and id_policy.
In the RMSE leaderboard, you will have no same id_policy since it’s only 4 years, so you cannot use this information.

=> you can use past_claims in profit_leaderboard but not in rmse_leaderboard…

edwin_graham · February 9, 2021, 10:57am

Are we allowed to know the proportion of test rows that are id_policy values that are in the training set? I thought I read in one of the other threads that it was 60:40 in one direction or the other, but I can’t find that now and it’s not on the front page.

alfarzan · February 9, 2021, 4:54pm

Hi @edwin_graham

Yes it’s not explicitly written. The final test set has 100K policy IDs with year = 5. Your training data consists of 60K policies with years 1 - 4. The 5th year for all of these policies appears in the test.

On the dataset section on the overview page this is described in detail if I’m correct?