Out of 4316 cases in the training set, only 3504 compound structures are unique. That means the training set contains 812 duplicate structures (18.81%). While that could be intentionally introduced noise, or even a decoy set, it is also quite common SMILES error and could be simply miss-annotations.
The implications are however severe, some of those contain also different flavor annotations.
Compound #288 and #1664 are exactly the same. This could be a problem of unique SMILES codes, which was not used by the data providers, leading to multiple annotations. Or as mentioned before its noise or decoy intentionally introduced by the competition providers.
Either way, if the 812 duplicate structures have different flavor annotations, they have to be removed (garbage-in --> garbage out). Otherwise the trained model is false. If the annotations are the same, one compounds can be removed, in order not to overfit the data with the same component twice.
See details here:
Trust, but verify: On the importance of chemical structure curation in cheminformatics and QSAR modeling research There are many ways to deal with that, one example is given here (duplicate removal in Python):
I have to include a picture below, because the web editor here actively destroys SMILES codes:
Training set structural duplicates (excerpt):