Question about data

tim_robin · December 19, 2020, 6:26pm

The training data has 228216 rows, 57054 unique id_policy values. The premade functions’ docstrings all say “Each row is a different contract. This data has not been processed.” but if they are in the same format as the training data, this is false right?

Also, given the training data, it makes a lot of sense to look at the 4-year evolution of each contract by grouping rows together by id_policy, however, based on the description of the final test data, we will only have 100k rows of 5th year data right?

The main point of my question is to understand if I can group by id_policy and treat those as 4 step time series, i.e. use prior rows of with the same id_policy to help predict the current target variable, or should I only use features from the specific row when predicting the target variable?

alfarzan · December 19, 2020, 6:48pm

Hi @tim_robin

Each row represents a single policy over 1 year. However, each policy appears in the training data 4 times which is tracking it for 4 years.

The docstring

Regarding the docstring, each row is a different contract in the sense that in your final predictions you are providing an estimate for the claim, and offering a premium price, for one row of the data. That’s why it is worded in that way. One row, is one year. That’s one contract. It can be renewed and that’s why it repeats over multiple years.

The test data

You are right that in the final test data there will be around 100K contracts coming in for the 5th year. However, the idea is that, like an insurance company, some of these contracts you will have seen before in the training data, while others may be contracts that you have never seen before.

Time series or not

Regarding your question as to whether you should treat the 4 years as a time-series or you should treat each year independently based on it’s features… well, that’s up to you.

In practice you will have to be able to handle cases where you don’t have a history of the same contract. So your model must have the flexibility of being able to price something without a history.

In the real world companies use a mixture of both approaches so both are entirely valid and sensible.

Has this clarified things?

tim_robin · December 19, 2020, 6:56pm

Thank you for the quick response. I do have one concern. In real life, a flexible approach that does take advantage of prior history with this contract, when available, and does it’s best when the contract is new does seem like the best approach. My concern is that when my model is being tested, how will it be able to retrieve the previously scene historical data on contracts it has seen before? In the real world, this is simple, but in the artificial settings of this competition, how would that happen?

alfarzan · December 19, 2020, 6:59pm

That’s an interesting point. I think it can happen in a similar way that happens in a real world: a lookup table.

For example, you could have an object returned by your fit_model function that contains a historical lookup table. That way you can check if you’ve seen a contracts’ history before you price it.

Though I’m sure the community will be able to find all sorts of more clever ways as well.

tim_robin · December 19, 2020, 7:05pm

ok cool I figured we wouldn’t be allowed to include the training data as part of our submission. Thanks for the help