Is the evaluation method for the leaderboard scores the same as in local_evaluation.py
?
Are the prompts and LLM used for evaluation also the same?
I’m wondering because the scores I get locally don’t match the ones on the leaderboard.
The evaluation of the leaderboard follows a similar process as local_evaluation.py
. However, the exact prompts used to evaluate the answers are different.
Therefore, some difference between local evaluations and the leaderboard should be expected.
Thanks!
May I ask why some submissions seems correctly submitted but not graded?
For example:
“AIcrowd | Single-source Augmentation | Submissions #283034”
“AIcrowd | Single-source Augmentation | Submissions #283028” (not mine)
This should be a transient bug. We will trigger re-evaluations to them, and they should be fine after that.
I am also having a similar issue with my submission.
I get similar number of “I don’t know” responses and exact correct matches as my local evaluation. But I get zero accuracy from the judge. I feel like the judge api might have errors for my submission. If it was from my side, I wouldn’t be able to have the same missing and exact correct match counts.
Yes, it seems that a bug has recently appeared.
It looks like no one has been able to get a “correct” other than exact match.
Even when submitting the exact same commit as a submission that achieved “correct” on April 29, I can’t get the same score.
@yilun_jin8
Did you make any changes to the evaluation metric since then?
We will investigate this and come back to you as soon as possible.
@Camaro @aerdem4 We have found the cause of the error and it’s been fixed. We have triggered re-evaluations on the recent submissions, and you will see correct scores after the re-evaluations are done.
Thanks, have you completed the process? It seems that some submissions have not been re-evaluated yet.
@Camaro @aerdem4 From our end, we have finished all re-evaluations. Please give us your submission IDs if you think that they haven’t been updated.
I’ve confirmed that now it’s updated. Thanks!
@yilun_jin8
Another question. What if all the top 10 teams in phase 2 will use submissions with prompt injection, like on the current leaderboard? I think the current auto-evaluation method is too weak against prompt injection and can be easily exploited.
I believe manual evaluators would consider such answer invalid(wrong), but since only 10 teams are selected for manual evaluation, there’s a risk that none of the top submissions are meaningful.
We acknowledge the concern, and are actively discussing this with Meta.
We hope to have a response to this question soon.
Thanks for the quick action!
In my humble opinion, modifying the evaluation prompt is not a solution.
You just need to declare that a prompt injection solution will be eliminated before selecting the top 10 teams.
I agree.
And having being closely involved in the final due-diligence of many competitions on AIcrowd, I assure you, that we do reserve the right to disqualify submissions that are clearly trying to exploit a loophole, or are not aligned with the spirit of the competition.
We will check in with the Meta team, on how best to include this officially in the Rules of the competition, if its not already addressed.