RDKit has problems with chirality, filtering on canonical SMILES is not good enough. There are 150 unresolved issues related to chirality in RDKit. It works of course when used properly. (https://github.com/rdkit/rdkit/search?q=Chiral&type=issues) Also using an “internal” script based on “RDKit” does not prove anything.
Using inchi and inchikey filtering based on openbabel (It encodes stereochemistry, yes it does, https://www.inchi-trust.org/technical-faq-2/#8.2). section “3.10 Remove duplicate molecules” (https://readthedocs.org/projects/open-babel/downloads/pdf/latest/) will reveal many duplicates in the test and train set and also in the overlapping sets. Asserting that “InchiKey is not relevant for this comparison” is just what it is, an assertion. See the references above. Here is a tiny example that both R/S and Z/E stereoisomers and non-stereo functions are encoded correctly by the InchiKey. That is not the case for the provided SMILES codes in this competition. They look different, but actually encode for the same molecule.
C\C=C\C InChIKey=IAQRGUVFOMOMEM-ONEGZZNKSA-N (trans)
C\C=C/C InChIKey=IAQRGUVFOMOMEM-ARJAWSKDSA-N (cis)
CC=CC InChIKey=IAQRGUVFOMOMEM-UHFFFAOYSA-N (no stereo)
C[C@@](F)(Cl)Br InChIKey=IUEOVIJFFFDZTG-REOHCLBHSA-N (R)
C[C@](F)(Cl)Br InChIKey=IUEOVIJFFFDZTG-UWTATZPHSA-N (S)
CC(F)(Cl)Br InChIKey=IUEOVIJFFFDZTG-UHFFFAOYSA-N (no stereo)
I have used four to five independent tools (including fingerprints) and they all come to a similar conclusion, this dataset has issues when considering R/S and E/Z stereochemistry, because the provided SMILES are not canonical (canonical <> unique) and they are overlapping. That does not relate to the provided properties (smell). It only takes into account the SMILES codes itself and chirality and E/Z isomers.
The OpenBabel kit (http://openbabel.org/wiki/Main_Page) and section “3.10 Remove duplicate molecules” in this PDF (https://readthedocs.org/projects/open-babel/downloads/pdf/latest/)
training set SMILES: 4316 --> 3504 unique with stereo (delta=812)
test set SMILES: 1079 --> 1032 unique with stereo (delta=47)
train+test unique: 4536 --> 4117 unique with stereo (delta=419)
Basically the test set is not unique, the training set is not unique and the remaining unique compounds when combined in total (to calculate the train/test overlap) are also not unique.
Here is the function for RDKit that uses 'Chem.MolToSmiles" with chirality (includeChirality=true),
http://www.mayachemtools.org/docs/scripts/html/RDKitRemoveDuplicateMolecules.html based on a larger toolkit with lots of other useful functions: http://www.mayachemtools.org/docs/scripts/html/index.html
The results are similar (delta=1) not sure what this is, but it is in the same range and uses the “rdkit.Chem.rdMolHash.HashFunction.CanonicalSmiles” function. There could be other issues like protomers and tautomers, but the numbers speak for themselves.
python C:\mayachemtools\bin\RDKitRemoveDuplicateMolecules.py --ov -i train-direct.smi -o train-dups.smi
Total number of molecules: 4316
Number of valid molecules: 4316
Number of ignored molecules: 0
Total number of unique molecules: 3505
Total number of duplicate molecules: 811
Total time: 5 wallclock secs ( 2.44 process secs)
python C:\mayachemtools\bin\RDKitRemoveDuplicateMolecules.py --ov -i test-direct.smi -o test-dups.smi
Total number of molecules: 1079
Number of valid molecules: 1079
Number of ignored molecules: 0
Total number of unique molecules: 1032
Total number of duplicate molecules: 47
Total time: 1 wallclock secs ( 0.70 process secs)
So overall I would not trust faulty SMILES codes. I am sure in conflicting situations, such as one would not know the specific enantiomer, multiple smells could be recorded for the same molecule. However that will hamper the total outcome, because the machine learning process is based on the (faulty) and overlapping SMILES codes.