"AssertionError: Unknown smell words provided in the predictions file"

Got this error for my submission, but it seems to be ok.
My checking script:

def check(predictions, vocabulary):
    for prediction in predictions:
        for single_sent in prediction.split(";"):
            for single_odor in single_sent.split(","):
                if single_odor not in vocabulary:
                    print("WTF " + single_odor)
                    return False
    return True

vocabulary = set([x.strip() for x in open('vocabulary.txt')])
submit_df = pd.read_csv('my_submit.csv')
check(submit_df.PREDICTIONS.values, vocabulary)
# True

Is there a way to get more details about root cause of the error?

# df has the columns 'SMILES', 'SENTENCE','PREDICTIONS'
# LABEL_COL is the word list= read from vocabulary.txt
def check_df_error(df):
    predict = df['PREDICTIONS'].values
    word = set()
    for p in predict:
        p = p.replace(';',',').split(',')
        p = set(p)
        word = word.union(p)

    label = set(LABEL_COL)
    diff = label.symmetric_difference(word)
    #print('diff',diff) #word not used in prediction all all
    print('diff:',diff-label.intersection(diff)) #word not found in LABEL_COL at all

i suspect you have a ‘’ entry (i.e. empty string).
Note that in your code if split returns empty list, it will not do any checking all all! this can happen if you have bug or forget to insert ‘,’ or ‘;’ in your prediction


I’m getting the same error from the following dummy submission:

output = pd.read_csv("../data/raw/test.csv", sep=",")
output["PREDICTIONS"] = ";".join([",".join(["fruity"]*3) for i in range(5)])
    header=True, index=False,
    sep=",", quoting=csv.QUOTE_MINIMAL)

The header and a random line of the csv file:


I’m failing to see what’s the problem… maybe an encoding issue ?
Could we get some more insight in the error message on which are the unknown smell words ?

I did a binary search in order to debug problem and run out of submissions for today, but my current hypothesis - this error is raised when there are duplicates in a sentence.

Try to submit something like “fruity;green;herbal” dummy - I think it will pass the check

ok i tried:

";".join(["fruity,woody,herbal" for i in range(5)])

and it worked. So yeah, you can’t have duplicates in the 3 odors group

Thanks for pointing it out!

@mohanty, I believe this should be fixed or highlighted in description page. Since Jaccard Index is calculated over sets, I expected scoring function to work well with duplicates.

@guillaumegodin: I think @mtrofimov 's point was more around a helpful error message so people are not left confused.

We are debugging this particular issue, and will get back on this thread with an update soon.


@cedric_bouysset @mtrofimov : Confirming that we have updated the evaluator to be more kind towards repeated words in the same sentence. This was also a bug only in the sentence validation. During the score computation, the list of words was always type casted into a set which ensured that duplicate iterations of the same word would not affect the final score.