FastAI Starter Notebook and Class Balance

I made a quick starter notebook for anyone wanting to play with FastAI tabular:

Running the notebook gets a log loss of ~0.9 (so not great) BUT that improves to 0.66 if you drop 80% of the ‘normal’ training examples to get a more balanced dataset. This undersampling approach has its downsides, so I’m curious if anyone has suggestions on other ways to work around the class imbalance.

If you find the above tip or the notebook helpful please let me know :slight_smile: Also happy to answer any questions on this approach - leave them in this thread :slight_smile:


Excellent notebook, thank you for sharing!

One point that makes me think, is the size of test dataset (only 362 obs.), meaning significant variability of the criteria. Some logloss of your notebook are so closed, they could be considered as similar, with no guarantee order will stay the same on leaderboard. I experimented it, with a model that decrease test dataset logloss (while never seen it), but increase leaderboard one … :slight_smile:

My feeling is that current top 10 are the ones, as @no_name_no_data said, that slight overfit leaderboard dataset.

I am curious to see our this will evolve, I hope some participants will found tricks that generally reduce logloss.