I put this competition on my possible project list and am only getting around to data exploration now.
The training dataset includes only policies with four years of claims data (and from the description these policies are renewed for a fifth year). But, the evaluation dataset has data that might not be renewed in the next year (unless I’m misreading the description). Does that not bias the sample in a potentially important way if that constraint is not there for the evaluation datasets? There are many reasons why a policy would not be renewed but one would be if the driver was involved in a “very bad crash”. Is there a danger of selection bias where “very bad crashes” are under-represented in the training data but are more likely to be in the evaluation data?
Or am I misunderstanding the description?