Evaluation Method of Leaderboard

Is the evaluation method for the leaderboard scores the same as in local_evaluation.py?
Are the prompts and LLM used for evaluation also the same?
I’m wondering because the scores I get locally don’t match the ones on the leaderboard.

The evaluation of the leaderboard follows a similar process as local_evaluation.py. However, the exact prompts used to evaluate the answers are different.

Therefore, some difference between local evaluations and the leaderboard should be expected.

1 Like

Thanks!
May I ask why some submissions seems correctly submitted but not graded?

For example:
AIcrowd | Single-source Augmentation | Submissions #283034
AIcrowd | Single-source Augmentation | Submissions #283028” (not mine)

This should be a transient bug. We will trigger re-evaluations to them, and they should be fine after that.

I am also having a similar issue with my submission.

I get similar number of “I don’t know” responses and exact correct matches as my local evaluation. But I get zero accuracy from the judge. I feel like the judge api might have errors for my submission. If it was from my side, I wouldn’t be able to have the same missing and exact correct match counts.

Yes, it seems that a bug has recently appeared.
It looks like no one has been able to get a “correct” other than exact match.
Even when submitting the exact same commit as a submission that achieved “correct” on April 29, I can’t get the same score.

@yilun_jin8
Did you make any changes to the evaluation metric since then?

We will investigate this and come back to you as soon as possible.

1 Like

@Camaro @aerdem4 We have found the cause of the error and it’s been fixed. We have triggered re-evaluations on the recent submissions, and you will see correct scores after the re-evaluations are done.

Thanks, have you completed the process? It seems that some submissions have not been re-evaluated yet.

1 Like