Got this error for my submission, but it seems to be ok.
My checking script:
def check(predictions, vocabulary):
for prediction in predictions:
for single_sent in prediction.split(";"):
for single_odor in single_sent.split(","):
if single_odor not in vocabulary:
print("WTF " + single_odor)
return False
return True
vocabulary = set([x.strip() for x in open('vocabulary.txt')])
submit_df = pd.read_csv('my_submit.csv')
check(submit_df.PREDICTIONS.values, vocabulary)
# True
Is there a way to get more details about root cause of the error?
# df has the columns 'SMILES', 'SENTENCE','PREDICTIONS'
# LABEL_COL is the word list= read from vocabulary.txt
def check_df_error(df):
predict = df['PREDICTIONS'].values
word = set()
for p in predict:
p = p.replace(';',',').split(',')
p = set(p)
word = word.union(p)
label = set(LABEL_COL)
diff = label.symmetric_difference(word)
#print('diff',diff) #word not used in prediction all all
print('diff:',diff-label.intersection(diff)) #word not found in LABEL_COL at all
i suspect you have a ‘’ entry (i.e. empty string).
Note that in your code if split returns empty list, it will not do any checking all all! this can happen if you have bug or forget to insert ‘,’ or ‘;’ in your prediction
I’m failing to see what’s the problem… maybe an encoding issue ?
Could we get some more insight in the error message on which are the unknown smell words ?
I did a binary search in order to debug problem and run out of submissions for today, but my current hypothesis - this error is raised when there are duplicates in a sentence.
Try to submit something like “fruity;green;herbal” dummy - I think it will pass the check
@mohanty, I believe this should be fixed or highlighted in description page. Since Jaccard Index is calculated over sets, I expected scoring function to work well with duplicates.
@cedric_bouysset@mtrofimov : Confirming that we have updated the evaluator to be more kind towards repeated words in the same sentence. This was also a bug only in the sentence validation. During the score computation, the list of words was always type casted into a set which ensured that duplicate iterations of the same word would not affect the final score.