SMILES Canonicalization

contrebande · November 6, 2020, 8:24pm

If I take all the SMILES contained in both train.csv and text.csv and canonicalize them in Java (all my NLP, ETL, streaming analytics, etc. is in Java) using CDK. I find that 4055 on a total of 5397 cannot be found in The Good Scents Company website, which contains more than 30,000 aromachemicals. Doesn’t make much sense to me (I expected a much lower “not found” entries). I am using CDK-canonicalized SMILES representation to index the TGSC data as well. I have detailed the problem in Github issues for the respective projects (CDK, RDK), because I think it lies in the different canonicalization algorihtms used by RDKit (which is the one I assume was used for the challenge dataset) and CDK. There is also the infortunate issue that I can’t seem to get RDKit up and running in a Java environment.

If anybody here can help, I’d be infinitely grateful. Thanks in advance. I’m stuck here for the moment with my intention of submitting something for Round 1 using domain knowledge, but I will continue to survey the notebooks and papers lying around in the meantime…

contrebande · November 6, 2020, 9:28pm

And when I configure CDK to canonicalize more aggressively, I’m down to 399 “not founds” (there might be a lot of false positive, however). But there is hope. I will be waiting for an answer from both projects’ dev team to see if I can bring that number further down. Zero (0) is what I’m aiming for.