Dear Organizers,
We have observed many instances where the automatic evaluation script, local_evaluation.py
, is not stable when processing short answers. These answers, despite being correct, are often shorter than the provided “ground truth” responses.
For example, in the v.0.1.2 validation set:
-
Interaction ID:
00663475-7bf0-4c70-bba5-80bd9425082d
-
Query:
When did this artist release his first studio album?
-
Ground Truth:
Chuck Berry released his first studio album, After School Session, in 1957.
-
Agent Response:
1957
-
local_evaluation.py:
{'accuracy': False}
We consider the agent’s response of “1957” to be correct, yet the local_evaluation.py
frequently marks it as incorrect. It is because the system prompt write as:
"You are an expert evaluator for question answering systems. "
"Your task is to determine if a prediction correctly answers a question based on the ground truth.\n\n"
"Rules:\n"
"1. The prediction is correct if it captures all the key information from the ground truth.\n"
"2. The prediction is correct even if phrased differently as long as the meaning is the same.\n"
"3. The prediction is incorrect if it contains incorrect information or is missing essential details.\n"
"Output a JSON object with a single field 'accuracy' whose value is true or false."
We are wondering how such cases will be judged in the online auto evaluation, and more specifically, how organizers will assess them. Personally, I would consider “1957” a perfectly correct answer (score = 1), though it could also be treated as an acceptable answer (score = 0.5).
We’ve noted this is a common occurrence because the ground truth in the current data includes both an answer and a brief reason (ans_full
), which differs from last year’s format where we had both short and full answers.
Thank you for your clarification on this matter.
Sincerely,
Fanyou