In the training dataset, there are several compounds with repeated values in the “sentence”. Take, for example, C=CCS, which has the following sentence: “alliaceous,cooked,alliaceous,alliaceous,roasted,burnt,meat”
What is the interpretation of the replication of “alliaceous” in this sentence? Is this an error, is there some actual meaning to this replication?
@robert_allaway: Thanks for pointing it out. The reason for the repetitions of the odor words in some of the sentences is because many low-frequency odor words were replaced with their parent categories in the first version of the dataset. Hence, in this case, all the repetitions can be safely treated as a single instance of the odor word in the same sentence.
In the example above, the sentence for
C=CCS can be treated simply as
In the future rounds of the competitions we will be releasing the data without the low-frequency odor word replacement.
In any case, given that each of the sentences represent a
set of the odor words, the representation do not affect the problem, or the evaluation metric computation. We will however soon update the current dataset to remove the above mentioned replications.
@robert_allaway, thanks to pointing this question. As Mohanty said, we decided to increase middle frequent terms by replacing low frequent terms. Without this method, first, we will have to work on a 1k order description space and second we will have very high frequent terms that are interesting but too general terms. we try to simplify the task and reduce data disparity.
The simplest description of your given molecule
Remember that we only focus on the first 3 terms for the evaluation of the model. Order in description has a real and strong importance, the first the stronger.
Is that mean that, for C=CCS, if the prediction was ‘alliaceous, cooked, roasted’, it would get higher score compared to ‘cooked, roasted, alliaceous’?
As far as I know, Jaccard Index is not order related.
Today, we use Jaccard metric for the first round. We will see at the end of this round what are the scores. Based on that we may used an order metric. In the meantime, my comment was more related to the fact that the first occurence of a term win in term of priority. so ‘alliaceous,cooked,alliaceous,alliaceous,roasted,burnt,meat’ need to be converted to ‘alliaceous,cooked.roasted.burnt.meat’ and not to ‘cooked.alliaceous,roasted,burnt,meat’
@young @guillaumegodin: Also confirming that, the datasets in the resources section have been updated to remove the duplicate items.
Thanks @mohanty and @guillaumegodin for clarifying and for updating the dataset!