#3 solution to Learning to Smell

alarih · February 17, 2021, 1:21pm

This was one of my favorite challenges so far, because the problem formulation is very simple and it attempts to get insight into one of our primal but neglected basic senses. My solution was far behind top 2 competitors, so I feel like I was missing some crucial ingredient, so I am looking forward to learn about their approach.

The core of my approach is neural net on fingerprints.

Data: union of various fingerprints extracted with rdkit from the SMILES in train set

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import MACCSkeys
mol = Chem.MolFromSmiles(smiles)

fp0 = MACCSkeys.GenMACCSKeys(mol) # MACCS keys
fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, 256) # Morgan fingerprints
fp2 = Chem.RDKFingerprint(mol)
fp3 = [len(mol.GetSubstructMatch(Chem.MolFromSmarts(smarts)) > 0 for smarts in smarts_inteligands] # smarts_inteligands has about 305 smarts patterns

Preprocessing: drop constant and duplicate fingerprints

Model:

from torch import nn
hidden_size = 512
dropout = .3
output_size = 75
nn.Sequential(
            nn.Linear(input_size, hidden_size),
            nn.ReLU(inplace=True),
            nn.Dropout(dropout),
            nn.BatchNorm1d(hidden_size),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(inplace=True),
            nn.Dropout(dropout),
            nn.BatchNorm1d(hidden_size),
            nn.Linear(hidden_size, output_size),
        )

Training was done over 5 folds, each one for 25 epochs with nn.BCEWithLogitsLoss. The model tried to predict probabilities of 75 smells.
The last step was to come up with 5 prediction sequences starting from individual smell probabilities. For this I sampled smells using their predicted probabilities and found the sequence with the best jaccard score. Then found the next sequence with the best incremental jaccard score and so on.
Bells and whistles. Some of the things that made small improvements:

label smoothing
weighting labels for training
weighting fingerprints based on their estimate importance

Things that didn’t work:

PCA on features and on labels
UMAP on features and on labels
pretraining on 109 labels
continous version of IOU loss instead of BCE for training
various learning rate schedulers
dropping fingerprints with high correlation to others
trying another dropout/learning rate

rejulien · February 17, 2021, 10:25pm

Thank you for sharing ! Your step 5 is particularly interesting as I spent a lot of time trying to make something meaningful to propose 5 different predictions.

I am too afraid of a leaderboard shake up to share my approach at the moment: the test set seems small as I observed large differences between my local CV and LB results so far…

alarih · February 18, 2021, 4:05am

thanks. yeah, that step gave a small boost.
I had 2 ways to get validation scores:

out of fold score for individual models: around .39
average prediction of fold models on the held out set: around .41
LB: around .31
It’s possible that the distribution of the test smells was different from what we had in the train set or molecules were structurally different, hence the discrepancy between LB and validation. In previous rounds results were much closer.