I want to raise organziers attention that Llama 3 had a larger vocabulary size (128K) comparing to llama 2 (32K). So we need to clear define in the rule that what tokenizer is used to truncate the response (previously the code used llama 2 tokenzier).
Best
Fanyou
@wufanyou : The starter kit already includes the tokenizer we are using on the evaluator to limiting the maximum token size of the response.
@aicrowd_team Yes. I understand that the code has already had this tokenzier. But Llama 3 had different vocab size (128K vs 32K). In some cases, the output number of tokens will be smaller than that of llama 2 if the output texts are the same. In terms of the model performance, LLama 3 is better (in the report) and I foresee people might use it. So I suggest if we can replace the current tokenzier for truncating predictions to Llama 3’s.
Hi wufanyou,
We use the same tokenizer for both llama 2 and llama 3 models so that it is a fair comparison.
Thanks,
The CRAG Team