Final data question

jocelyn · March 1, 2021, 2:47pm

Hi,
Question on the composition of the final dataset.
It says it’s around 100k policies on the 5th year. Some in the training and some new.
Looking at the numbers, Training: 60K, RMSE: 5K, 10 Weekly: 30K.
I’m wondering if the data is the 5th year of all of those policies? So the “new to you” would be from the fact that we didn’t actually ever see the data for the policies in RMSE and 10 weekly sets.

If I’m understanding correctly, the pol_sit_duration would be at least 5, because in the training data pol_sit_duration is never smaller than year?

alfarzan · March 1, 2021, 4:06pm

Yes I think that is correct, because all policies in the data have a 5 year history, in year 5 the pol_sit_duration will have a minimum of 5.

About the distribution of policies between the leaderboards, training and test, it is as follows:

Training. 60K policies X 4 years ~ 240K rows
MSE leaderboard. 5K policies X 4 years ~ 20K rows
Profit leaderboard (weeks 1 - 5). 15K policies X 4 years ~ 60K rows
Profit leaderboard (weeks 6 - 10). 30K policies X 4 years ~ 120K rows
Final evaluation leaderboard. 100K policies X 1 year ~ 100K rows

There will be some overlap between the policies in (3) and (4) so you’ll see the numbers add up to slightly more than 500K rows and 100K policies. But this is the rough structure.

The data from the 5th year will include all of the 60K rows and all of the policies that you’ve seen in the leaderboards as well.

I hope that clarifies things If not then we can continue