Questions about the leaderboard

According to current leaderboard(take track-1 as example):

  1. It seems that the leaderboard is sorted by accuracy instead of truthfulness, which is contradict with the document:

  2. The calculation of the truthfulness seems not to be correct. Hallucination will cause negative score, and the truthfulness should be negative for baseline with a high hallucination. Using local_evaluation in the start kit will lead to a reasonable result below:


    But I don’t know how the truthfulness in the leaderboard is calculated, which seems strange.

+1 to your questions.
I have additional related questions to the organizer:

  • Is the prompt defined in local_evaluation.py the one used for auto-evaluation of the leaderboard?
  • If no, can you share the prompt used for the leaderboard?
  • If yes, how does it define “acceptable” (score: 0.5)?