It's (almost) over! sharing approaches

Same :crying_cat_face:

1 Like

I’m actually a bit miffed tbh, I get that it’s an oversight on our part given the ambiguous wording, but I’ve got a few “NCD history” features embedded in my pre-processing - why would we spend 10 weeks with one data structure (requiring the pre-processing code to calculate these features on the fly), and then have to refactor this for the final submission …

Luckily I’ve got a “claim history” data frame as part of my final “model” which was added last minute and gives some sizeable loadings (over and above my ncd change history feats), so I’ll have some mitigation from that.

I understand that this was not as clear as it could have been. Can I ask how exactly were you and @michael_bordeleau (and potentially others) expecting the final dataset to look like?

We could make a few exceptions and make it work :muscle:

1 Like

I’m guessing the admins maybe didn’t foresee the use of prior history features? This is obviously super common in actual insurance rating plans, but I can’t really think of another reason. My prior year variables were also quite predictive, but knowing that I would only have them for a little over half the final policies, I thought about having two sets of models:

  1. For the 57K in our training set, use the best models which have the prior year features, and
  2. For the other policies, use a subpar set of models trained without using any of the prior year features.

Ultimately, probably from running out of steam, I decided to just use the subpar models for all policies. I did something similar to @simon_coulombe and saved a simple list of policies with the number of years they had claims in our training data. I ended up doing some kind of out there feature engineering and with the new set of variables, got pretty close to the accuracy of the models using prior year features.
I definitely think a different process probably makes more sense for the final evaluation and I have my gripes with the preprocessing function, but I don’t think the explanation about the final data set was ambiguous though: “The final test dataset, where the final evaluation takes place, includes 100K policies for the 5th year (100K rows).

3 Likes

I’m at the same place as you. Seems like we had similar thinking all along this competition.

Everything was framed in a way to use all the data available.

When you submit a model, your model makes predictions for:

  1. Year 1 with access to data from year 1.
  2. Year 2 with access to data from years 1 - 2.
  3. Year 3 with access to data from years 1 - 3.
  4. Year 4 with access to data from years 1 - 4.

This is calling for us to develop features that look into the history of a client.

And so were the weekly leaderboards, (unless I misunderstood it as well).

1 Like

I feel like that wouldn’t be fair to the rest of us. Since I changed my models assuming we would just have year 5.

3 Likes

I expected the data to be structured in the same fashion as the RMSE leaderboard.

Therefore, to quote year 5, we would have access to year 1 to 5 in terms of underwriting variables.

When you submit a model, your model makes predictions for:

  1. Year 1 with access to data from year 1.
  2. Year 2 with access to data from years 1 - 2.
  3. Year 3 with access to data from years 1 - 3.
  4. Year 4 with access to data from years 1 - 4.

Would be:
… year 5 with access to data from years 1 - 5, on the policies that have this information.

For new business, I understand that there would not be any info.

1 Like

Hmm ok we’ll look into this and see the scale of the issue. If many people had this issue then we will definitely take action, and if any model fails because of this we’ll get in touch :+1:t2:

2 Likes

Well, as we saw from the weekly profit leaderboards, a more accurate model might not be that helpful anyways :laughing:. Since year 5 is completely new data, I also wondered if the prior year features would correctly extrapolate/generalize. I guess I probably wouldn’t change my submission at this point regardless. Looking forward to seeing the results!

It’s a shame the confusion. In my case, from the first moment I understood that the predictions with access to past information were only for the RMSE classification. In my opinion it is very clear in the instructions that the final leaderboard is only with information from year 5 and that it should be a model that works policy by policy, without past information.

In fact, I did a similar approach for other participants by saving some of the past information in my model. It would seem so unfair to me that the participants who have been confused do not participate after the work carried out as if they had advantages over those who have understood the methodology of the final classification.

I would propose that confused participants be allowed to adapt their model to receive only information from year 5. But they should not be able to change the price, or create new variables… Just make the minimum changes to be able to participate without any additional benefits.

Finally, I want to add that this is the best competition I have ever participated in. Congratulations to the organizers for the great work.

2 Likes

I decided to sell “so expensive” because whenever I lowered the price my profit was negative. That made me feel sad because I think our solutions (markets) are not social optimum.

Probably my quantitative models are too conservative so there not much difference (compared to other modeling methods) between people who made at least a claim and other people.

2 Likes

Yes, I was expecting the same format.

I see what lolatu2 is trying to say, but I don’t buy the arguement that inclduding year1-4 rows is advantageous to michael, myself or others in our position - even if there is only year 5 data, prior year information can still be passed into your model, you just chose not to. I could have created a dataframe which contained all prior year features as part of my “model”, I just chose not to handle all prior year features this way because that would have meant refactoring my code (and it hadn’t even occured to me that this might be necessary, since the competition so far always gave access to prior years for all leaderboards).

2 Likes

Ok, thanks.

I tried my best to have a fool proof model so that it doesn’t crash. So I doubt it will crash…

Ultimately, the fix I would need is just to retrain the model without the few features developed on past data, such as the movement of NCD through time.

And I agree with RHG, we should not be able to change the price nor create any new variables.
(Easy to validate as staff has all our codes)

1 Like

I think the other intent from the admin’s side with the final test data is to simulate “new” business where we don’t have access to the prior year data.

The final test dataset, where the final evaluation takes place, includes 100K policies for the 5th year (100K rows). To simulate a real insurance company, your training data will contain the history for some of these policies, while others will be entirely new to you.

1 Like

Hmmm
I really don’t want Ali to be stuck having to verify thousand of lines of code to make sure nobody sneaked some new variables in it after we discussed our approaches in this very forum.

I think the best approach is to let you sink.

Just kidding. Feeding year 1-4 of the training data along with year 5 of the whole dataset and calling it a day seems the fairest approach. Anyone who needed that data could have just saved the whole training set in their zip file submission anyway.

I’ll take my bribe in bitcoin.

5 Likes

Exactly, many people will have already saved this information in their model (through an extra lookup table), so it makes no difference to them, only to those of us who were expecting the information to process those features “on the fly”.

Yea, I think that makes sense. So just run the test set through including the ~240K training data that we got, then filter the predictions on Year 5. That way, there is still the 40% policies that your models wouldn’t get prior year history for, which I think is in line with the original final evaluation intent. I just wouldn’t feed Years 1-4 through for all the policies, since that would actually cause others who understood the evaluation correctly to probably want to change their models. (I’ll take bitcoin too.)

4 Likes

Oh yeah, just to be clear, the only sensible solution I see regarding the 5 year history issue would be to include the training data when a model is being run. Though as @lolatu2 mentions the 5 year history would only exist for the training data.

Also, when code fails we won’t be sending your models back to be corrected, we will work with you to debug and fix any issue so that we don’t have to do this manual code inspection in-depth afterwards. Otherwise I will have to plan for a few more years at Imperial :older_man:

5 Likes

That’s clever. Didn’t think about that. I could have used the prior version of my model for the ~40K without history and my new model including prior claims for the ~60K with history.

Instead, I did bundle the ~40K new one with the no claim bunch. (which is not so bad because for year 1 everybody is assumed to be without a prior claim).

Anyway, fingers crossed :crossed_fingers:t2:

1 Like

I agree, otherwise I’d like to have access to that data as well haha.

1 Like