Is the evaluation method for the leaderboard scores the same as in local_evaluation.py
?
Are the prompts and LLM used for evaluation also the same?
I’m wondering because the scores I get locally don’t match the ones on the leaderboard.
The evaluation of the leaderboard follows a similar process as local_evaluation.py
. However, the exact prompts used to evaluate the answers are different.
Therefore, some difference between local evaluations and the leaderboard should be expected.
Thanks!
May I ask why some submissions seems correctly submitted but not graded?
For example:
“AIcrowd | Single-source Augmentation | Submissions #283034”
“AIcrowd | Single-source Augmentation | Submissions #283028” (not mine)
This should be a transient bug. We will trigger re-evaluations to them, and they should be fine after that.
I am also having a similar issue with my submission.
I get similar number of “I don’t know” responses and exact correct matches as my local evaluation. But I get zero accuracy from the judge. I feel like the judge api might have errors for my submission. If it was from my side, I wouldn’t be able to have the same missing and exact correct match counts.
Yes, it seems that a bug has recently appeared.
It looks like no one has been able to get a “correct” other than exact match.
Even when submitting the exact same commit as a submission that achieved “correct” on April 29, I can’t get the same score.
@yilun_jin8
Did you make any changes to the evaluation metric since then?
We will investigate this and come back to you as soon as possible.
@Camaro @aerdem4 We have found the cause of the error and it’s been fixed. We have triggered re-evaluations on the recent submissions, and you will see correct scores after the re-evaluations are done.
Thanks, have you completed the process? It seems that some submissions have not been re-evaluated yet.