Generalization scores seem to be very similar to sample efficiency scores, and for some teams even higher, which seems to be very unlikely given that generalization is strictly more difficult than sample efficiency. Does anybody know how generalization scores were calculated?
During the training, the agent is exposed to 200 game levels and 8M timesteps of the environment. The rollouts are performed for 1000 episodes (an episode ~ a different game level) without any cap on the max number of game levels. This was done on all 16 public + 4 private environments.
We evaluated every submission three times as described above and the final reported scores were the best of the three evaluations.