Question following the townhall meeting

Dear @guillaumegodin,

I have a question regarding one of your statements in yesterday’s townhall meeting for the learning to smell challenge.

You mentioned that rearranging the SMILES can improve accuracy on tasks. I have been trying to find out a way to use this, but have not yet been successful. I have found your contribution to RDKit for this, which works fine. But now I am stuck finding a way to use these additional SMILES. Any sort of fingerprint type embedding will be the same for all of the generated SMILES, so there is nu use in extra SMILES using fingerprint embeddings. I have tried multiple different ways to represent SMILES without using any embeddings, such as by char_to_int converting with zero padding and LSTMS’s, but none are able to predict above chance level. My background is not in chemistry, so I am likely missing something quite obvious here due to my lack of domain knowledge.

Could you please point us in a direction of a type of input representation that can make use of these newly generated SMILES?

Thank you in advance.

Best,

Cas van Boekholdt

1 Like

Hi

You can use augmentation by adding multiple input smiles for the same target:

So just replicate same Y for all augmented version of original canonical smiles. In this case if you do a 5 augmentation you can have 5 times more “data”. A trick not tested is to introduce a little noise in the Y (adding or removing randomely some terms in the target) during this replication but we never try this on our models.

Your data become like this:

Smiles 1, target 1
Smiles 1 aug 1, target 1
Smiles 1 aug 2, target 1
Smiles 1 aug 3, target 1
Smiles 1 aug 4, target 1

Smiles 2, target 1
Smiles 2 aug 1, target 2
Smiles 2 aug 2, target 2
Smiles 2 aug 3, target 2
Smiles 2 aug 4, target 2

etc…

By the way: nat comm paper is out https://www.nature.com/articles/s41467-020-19266-y

Best regards,

Guillaume

1 Like

Thank you for the response, @guillaumegodin.

I can see how the augmentation would work in practice. However, when I create a fingerprint embedding of e.g. Smiles 1 and Smiles1, aug1, they are the same. So then how does this replication add any value to the data? What kind of input representation preserves the difference between these augmented SMILES?

Best,
Cas

1 Like

In this case you have to develop an embedding that is not identical for two “smiles” version of the same molecule. This is why using defaults fingerprint / embedding is easy and developing new ideas is more complex. You need to train on a very large database ChEMBL 27 for example in fact TransformerCNN is a method we developed that does the job.



best regards,
Guillaume
1 Like

Hi @guillaumegodin,

In your townhall meeting presentation, at 44m09s :

Graph as-is today, is not augmentable, it’s limited to your size of the data, it’s something you will not obtain good results with it, I can tell you. If you compare the Google paper with the DREAM challenge that was 3 years ago or 4 years ago, Google wins, with graph, less than 4% accuracy. This is a very important point. They multiply by 10 the size of the data and win only 4% accuracy.

And later at 45m34s :

We’ve proven, we write in papers, that augmentation with SMILES is better than graphs.

I just want to confirm what the papers you are referring to are. Is the first one from October 2019 titled Multitask Learning On Graph Neural Networks Applied To Molecular Property Predictions where you compare different GNN architectures : “GGRNet”, “GAIN” and “GIN” ? And the second one, where you compare the “GIN” architecture to your Transformer-CNN with SMILES augmentation method, is from October 2020 and titled Beyond Chemical 1D knowledge using Transformers ? Did you choose “GIN” in your second paper to represent GNN architectures because you deemed it superior to the others from your previous paper?

Can you confirm that I have the above references and facts right, please ?

I’m asking because I would really like to reproduce a comparison between your augmented SMILES Transformer-CNN method with the most recent GNNs advances, within the specific context of the task at hand : predicting organoleptic olfactive labels (from a pre-defined vocabulary/ontology/taxonomy) from a SMILES representation. And perhaps throwing some other specific descriptors of interest into the SMILES-derived embedding just to be sure.

Thanks.