Questions about evaluation metric

How would the the score be calculated when the ground truth had less than 3 descriptions or more than 3 descriptions?
Should our prediction always has 3 kinds of smell in each sentence?

1 Like

Agree. That’s an important question. To test it out you could try simply uploading a file that in 1 instance has less than 3 smells in a sentence and see if it is graded or not.

It’s relevant because if the model predicts only 1 or 2 labels with relatively high probability you wouldn’t want a 3rd unlikely one added as any additional unmatched item lowers the similarity score.

currently in round one, we are asked to predict 1 to 3 tags per sentence.
we can predict up to 5 sentence.