Training and test set compound overlapping (9%)

Tobi · September 10, 2020, 8:39am

Based on topology and independent inchikey analysis around nine percent (9%) of the train.csv and test.csv compounds are overlapping. That usually should not happen, because it can lead to overfitting and/or distorted results. Stereoisomers (R/S, E/Z) were allowed and kept, however exactly overlapping compounds were not allowed (exactly same stereoisomer in training and test set or compounds with undefined stereo information).

While 10% overlap is not much, it will probably result in overly optimistic results. (Example URL: https://stats.stackexchange.com/questions/220378/consequences-of-overlap-between-training-validation-and-test-data)

Original (non unique) training cases:	4316
Original (non unique) test cases:		1079

COMPOUNDS (unique)	Inchikey	Topology
TRAIN (unique)	3504	3521
TEST (unique)	1032	1030
Sum unique (Train and Test)	4536	4551
COMBINED unique	4117	4150
DIFFERENCE (OVERLAP)	419	401
Percent Overlap [%]	9.24	8.81

So while the numbers for the inchikey and topology based test do not totally match, they give similar results. The provided SMILES keys in the test.csv and train.csv are “unique strings”, but they are not truly “canonical SMILES”. A simple solution would be to add inchikeys as additional column directly to the train.csv and test.csv so it will be quite obvious for everybody participating. Inchikeys can be easily generated with OpenBabel.

Related also to this: Nineteen percent (19%) of structures in the training set are duplicates

guillaumegodin · September 12, 2020, 7:24pm

Dear @tobi, I repeat my comments from Nineteen percent (19%) of structures in the training set are duplicates and add this little remark:

Topology take only into consideration atoms connectivity and not order of this connectivity (simple, double, aromatic, triple bonds), also caution InchiKey is not relevant for this comparison. Please use RDKit canonical and isomer output. you will see that those molecules are different.

But of course, we may have errors in the dataset but not more and 1-3% max.

Cheers,

Guillaume

guillaumegodin · September 16, 2020, 5:55am

Dear @tobi,

After a double check, we did not found duplicate Smiles/molecule using our internal script via RDKit toolkit.

Cheers,

Guillaume

Tobi · September 17, 2020, 7:00am

RDKit has problems with chirality, filtering on canonical SMILES is not good enough. There are 150 unresolved issues related to chirality in RDKit. It works of course when used properly. (https://github.com/rdkit/rdkit/search?q=Chiral&type=issues) Also using an “internal” script based on “RDKit” does not prove anything.

Using inchi and inchikey filtering based on openbabel (It encodes stereochemistry, yes it does, https://www.inchi-trust.org/technical-faq-2/#8.2). section “3.10 Remove duplicate molecules” (https://readthedocs.org/projects/open-babel/downloads/pdf/latest/) will reveal many duplicates in the test and train set and also in the overlapping sets. Asserting that “InchiKey is not relevant for this comparison” is just what it is, an assertion. See the references above. Here is a tiny example that both R/S and Z/E stereoisomers and non-stereo functions are encoded correctly by the InchiKey. That is not the case for the provided SMILES codes in this competition. They look different, but actually encode for the same molecule.

C\C=C\C		InChIKey=IAQRGUVFOMOMEM-ONEGZZNKSA-N (trans)
C\C=C/C		InChIKey=IAQRGUVFOMOMEM-ARJAWSKDSA-N (cis)
CC=CC		InChIKey=IAQRGUVFOMOMEM-UHFFFAOYSA-N (no stereo)

C[C@@](F)(Cl)Br InChIKey=IUEOVIJFFFDZTG-REOHCLBHSA-N (R)
C[C@](F)(Cl)Br  InChIKey=IUEOVIJFFFDZTG-UWTATZPHSA-N (S)
CC(F)(Cl)Br	    InChIKey=IUEOVIJFFFDZTG-UHFFFAOYSA-N (no stereo)

I have used four to five independent tools (including fingerprints) and they all come to a similar conclusion, this dataset has issues when considering R/S and E/Z stereochemistry, because the provided SMILES are not canonical (canonical <> unique) and they are overlapping. That does not relate to the provided properties (smell). It only takes into account the SMILES codes itself and chirality and E/Z isomers.

The OpenBabel kit (http://openbabel.org/wiki/Main_Page) and section “3.10 Remove duplicate molecules” in this PDF (https://readthedocs.org/projects/open-babel/downloads/pdf/latest/)

OpenBabel:
training set SMILES: 4316 --> 3504 unique with stereo (delta=812)
test set SMILES:     1079 --> 1032 unique with stereo (delta=47)
train+test unique:   4536 --> 4117 unique with stereo (delta=419)

Basically the test set is not unique, the training set is not unique and the remaining unique compounds when combined in total (to calculate the train/test overlap) are also not unique.

Here is the function for RDKit that uses 'Chem.MolToSmiles" with chirality (includeChirality=true),
http://www.mayachemtools.org/docs/scripts/html/RDKitRemoveDuplicateMolecules.html based on a larger toolkit with lots of other useful functions: http://www.mayachemtools.org/docs/scripts/html/index.html
The results are similar (delta=1) not sure what this is, but it is in the same range and uses the “rdkit.Chem.rdMolHash.HashFunction.CanonicalSmiles” function. There could be other issues like protomers and tautomers, but the numbers speak for themselves.

python C:\mayachemtools\bin\RDKitRemoveDuplicateMolecules.py --ov -i train-direct.smi -o train-dups.smi
Total number of molecules: 4316
Number of valid molecules: 4316
Number of ignored molecules: 0

Total number of unique molecules: 3505
Total number of duplicate molecules: 811

RDKitRemoveDuplicateMolecules.py: Done...
Total time: 5 wallclock secs ( 2.44 process secs)

---
python C:\mayachemtools\bin\RDKitRemoveDuplicateMolecules.py --ov -i test-direct.smi -o test-dups.smi
Total number of molecules: 1079
Number of valid molecules: 1079
Number of ignored molecules: 0

Total number of unique molecules: 1032
Total number of duplicate molecules: 47

RDKitRemoveDuplicateMolecules.py: Done...
Total time: 1 wallclock secs ( 0.70 process secs)

So overall I would not trust faulty SMILES codes. I am sure in conflicting situations, such as one would not know the specific enantiomer, multiple smells could be recorded for the same molecule. However that will hamper the total outcome, because the machine learning process is based on the (faulty) and overlapping SMILES codes.

PerfectDark · September 17, 2020, 8:16am

So basically the data set is a good example of a real-world data set which never are clean and unambiguous. I mean cleaning is part of model building so taking your observations into account you might be able to get an advantage over competitors that do not.

Is it really a problem to train the model with the same molecule but a different target/class? It just tells a model this molecule (in flat) can be any of the multiple options.

Having said that, I think it’s a bit useless to take stereochemistry / 3D into account anyway at the step this or similar models will be used, namely screening large libraries. Yes, you probably get a better score in this competition. But when applying the model, you end up needing to generate the 3D descriptors (meaning conformers and which ones do you use?) for your whole screening set which can be in the millions and is often flat anyway. And then when it comes to the actual synthesis you most likely will generate a mixture of stereo-isomers regardless of your specific target and you can then determine which one is the actually interesting one.

3D at this stage is premature optimization. It gets relevant when doing QSAR, Pharmacophores and docking albeit I have no clue if this is a thing in this industry as it is in Pharma.

guillaumegodin · September 17, 2020, 11:01am

Dear @Tobi,

First of all the 155 issues listed in RDKit github pages you can see contains only 48 remaining open ones, so please check carefully your sources before claiming anything based on it.

Plus not all of those issues are real chirality issue, per say.

Plus a recent chirality improvement was done in July with the integration of Schrodinger code, solved lot of remaining issue specially in rings.

Now about chemistry. There’s no chemist (including myself) that will buy your argument that E/Z-enantiomers and R/S-enantiomers are the same molecules, again you make an wrong assumption!

Equally, people familiar with SMILES will not buy any of your argument that C\C=C\C and C\C=C/C define the same molecule! These SMILES do define 2 enantiomers!

Enantiomers are different molecules and are a major challenge in chemistry and the pharmaceutical industry. Examples are well-known and thoroughly reviewed in chemistry/pharma literature: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3614593/

It’s thus correct that the SMILES and the InchiKey are different for two enantiomers molecules. These molecules do not define duplicates but two different chemical entities.

In SMILES this difference is made by the characters “@”, “/” and “\”. For R/S-enantiomers the difference is made using @ vs @@. For E/Z-enantiomers the difference is made by having a pair of parallel slashes “/” and “/” (or “\” & “\”) for trans-isomers and opposite slashes for cis-isomers “\” and “/” (or “/” & “\”).

In your example, C\C=C\C defines (2E)-butene and C\C=C/C defines (2Z)-butene which are two different molecules. Please consult the field “isomeric SMILES” on Pubchem, if you want to have a different source than RDKit. The links are https://pubchem.ncbi.nlm.nih.gov/compound/trans-2-Butene and https://pubchem.ncbi.nlm.nih.gov/compound/cis-2-Butene.

Plus the canonical forms, e.g. as published on pubchem, frequently define the canonical SMILES for the racemic compound. Here we have really interest for the enantiomerically pure forms.

Best regards,

Guillaume

francois_berenger · October 6, 2020, 8:06am

Very interesting!
This is the first time I encounter a dataset where stereochemistry is so important.
My submission doesn’t take into account stereochemistry. Maybe, in future rounds I will look into it.