According to current leaderboard(take track-1 as example):
-
It seems that the leaderboard is sorted by
accuracy
instead oftruthfulness
, which is contradict with the document:
-
The calculation of the
truthfulness
seems not to be correct. Hallucination will cause negative score, and thetruthfulness
should be negative for baseline with a high hallucination. Usinglocal_evaluation
in the start kit will lead to a reasonable result below:
But I don’t know how thetruthfulness
in the leaderboard is calculated, which seems strange.