Hello all I hope you are having fun still! As the deadline approaches, I thought I’d share some final tips that have helped me or which might improve your score. If you have extra tips to share or think some of these need correcting, please do share in the comments! Without further ado let’s jump in:
-
Spend some time refactoring. It’s amazing how much we miss when jumping about in a messy notebook titles sub_7_catboost_2.2.ipynb. Make a new notebook if you need to, put code into functions if you’re using it a lot, add lots of comments, and generally spend some time making it neat and tidy. This will help you test final changes and see what you might have missed, and it will also make life much more pleasant for anyone wanting to actually implement your solution!
-
Work on your local CV. The leaderboard tells us something, but with 10 submissions a day and an unseen test set you might well hit a case where a lucky model does better than it should on the LB. Local testing helps avoid this kind of overfitting. Don’t just rely on the validation set - try to implement some sort of cross-validation to evaluate your models on multiple subsets of the data.
-
Take advantage of the validation set - it’s only a few hundred extra samples, but when training for your final submission there is no harm in throwing those in with the training data!
merged = pd.concat([df, val])
might just get you that extra 0.001 -
Ensembles for the win. It’s sad, but the reality is that ensembles tend to win these things. Remember that for an ensemble to work it needs to be made up of different models - 10 replicas trained on the same data won’t be ideal. Find ways to mix up your training data (maybe taking a subsample each time) or to use different types of models or different hyperparameters.
-
mean([model1_preds, model2_preds…]) might not be ideal - if you do have different models in your ensemble, weight the predictions of ones that do better in your local CV higher. This way you are still taking the different predictions into account, but you’re relying more on the better models. Diversity without sacrificing accuracy. You can fine-tune the weightings but I tend to do something mode hand-crafted, with model_1_preds * 0.7 + model_2_preds * 0.15 + …
-
Class balance is important - as we’ve seen in other discussions, dealing with the dlass (I’m)balance is a massive part of this contest - get that wrong and it doesn’t really matter how good your model is. Check out my notebook on the subject or play around yourself - and if anyone has tips here please share!
- Fillna(0) - use with caution. I know at the start dealing with missing values can be a drag, but before your final submission think hard about whether there might be better options
That’s all the ones I can think of currently. Does anyone have recommendations for must-have features or feature engineering techniques?
Good luck to all - J
EDIT/PS: Totally thought there was only a week left, so these are not-so-final suggestions I guess!