Undersampling+bagging boosting 0.006 in LB

jsato · May 9, 2021, 12:54pm

Hi, all,

The most difficult problem in this competition is class imbalance.
Pre and post alzheimer class is too small compared with Normal.
To tackle this problem many people use downsampling like @moto.
It is good, but exploiting the rest of Normal samples should be better.

from sklearn.model_selection import KFold
kf = KFold(n_splits = (num_total_neg//num_neg))
for _, idx in kf.split(df_neg):
    bagging_neg = df_neg[idx]
    df_samples = pd.concat([df_pos, bagging_neg])
.........

@moto’s wonderful notebook (LB 0.616) get better(LB 0.610) with this bagging.(though I don’t compare it to seed averaging)
I hope it helps.

moto · May 13, 2021, 8:38am

Nice work @jsato.
I believe the final solution will be a big ensemble of many model types. Each model type is trained on many folds and runs.

jsato · May 15, 2021, 3:46am

Thanks @moto .
A example is here.
Vote me if you like.

moto · May 17, 2021, 6:17pm

Upvoted @jsato.

BTW, do you plan to team up with someone ?

jsato · May 17, 2021, 7:20pm

Yes, if someone wants.