I’ve notice that we can actually find some data from internet. eg. data from http://www.thegoodscentscompany.com/
there are lots of them. “goodscentscompany.com” is only one of them. But note that the label is different for the same chemical molecule in this challenge.
@hengck23 is right, labels are different cause source is different. We have decided to use top 5 metric in order to take into account this variation in the datasets source and also to be able to translate from one dataset to another dataset. This question of how to merge datasets is a very crucial point in AI.
We spend around a month to map & clean (reinforce the classes of descriptors with less occurrences) this dataset. The descriptions you have is a more stable “version” of the original dataset in term of statistical support. We use ontology tricks to do this job. I think you can spend a certain time also to look at the ontologies available on internet.
In language and emotion there is no one Truth.
So if you want more data you can try to find them from papers, books. You can also look at the Flavour domain too. The dataset we provide is oriented on Fragrance domain. But what we are looking for is new architecture too. So I don’t think increasing the dataset will help there. The cost of evaluation of molecules is very expensive and you need to train your panelists to have a “relevant” description sentence.