Hi organizers,
My team and I are currently looking into the Pokemon Red benchmark and noticed some discrepancies I’d like to clarify.
Regarding the score normalization, we noticed that completing just one task results in a score of 1.0. We wanted to verify if this is the correct behavior or the scoring normalization is incorrect.
Additionally, regarding the total number of tasks: the ORAK benchmark paper lists 12 tasks and seems like a maximum of 12 marks can be obtained according to the evaluation code, but according to the md file and the challenge rules, there are only 7 tasks. Could you confirm the exact number of tasks for this competition? If it is 7, could you specify which 7 tasks are included so we can align our testing?