Inquiry about Pokemon Red Benchmark Scoring and Task List

cheong_wei_xun · December 22, 2025, 2:33pm

Hi organizers,

My team and I are currently looking into the Pokemon Red benchmark and noticed some discrepancies I’d like to clarify.
Regarding the score normalization, we noticed that completing just one task results in a score of 1.0. We wanted to verify if this is the correct behavior or the scoring normalization is incorrect.

Additionally, regarding the total number of tasks: the ORAK benchmark paper lists 12 tasks and seems like a maximum of 12 marks can be obtained according to the evaluation code, but according to the md file and the challenge rules, there are only 7 tasks. Could you confirm the exact number of tasks for this competition? If it is 7, could you specify which 7 tasks are included so we can align our testing?

aicrowd_team · December 26, 2025, 10:44pm

Hey, thanks for pointing this out. This has now been fixed.