Regarding the scoring criteria for these four games, is the evaluation methodology based on the orak benchmark paper?
2. Remote Mode Episode Configuration
When running the games in remote mode, how many episodes are executed for each game? Is it consistently three episodes per game? Specifically for games like StarCraft II, does it run for four rounds (or episodes)? And for games like 2048, is the final score taken as the average of three rounds (or episodes)?
I have a few questions about the credit and scoring calculation.
Currently the total score seems to be computed by directly multiplying the raw game score by its weight. If that’s the case, does this mean there is no need to first normalize the score to a 0–100 scale using the conversion described in the Orak Benchmark Paper before applying the weight? Or is there a different calculation method being used?
Regarding the baseline to claim compute credits: for example, in Track 1 the baseline for 2048 is listed as 0.5 points. Does this mean that if I see a score of 4 points in my 2048 submission, I have already exceeded the baseline? Or should the score first be normalized, e.g. 4 / 20000 = 0.0002, in which case it would not actually exceed the baseline?
If the scoring does follow the conversion formula from the Orak Benchmark Paper, could someone clarify the exact score normalization formula for Super Mario, with concrete numbers, like how the normalized scores are calculated? And for StarCraft, is it correct that four evaluation runs (according to the benchmark paper) are performed and then aggregated during evaluation? Or for this challenge, is the game evaluated over three rounds, with each round contributing one third (33.33%) of the StarCraft’s final score?