@dipam_chakraborty You are correct that the current competition metrics do somewhat favor sample efficiency over generalization. The decision to restrict training to 8M timesteps when measuring generalization is largely a practical one. If compute were not restricted, we would prefer to train agents for as much time as is required for performance to converge (when measuring generalization). Since we must restrict compute, this does indeed lead to an unfortunate coupling of sample efficiency and generalization.
You mention that “Procgen benchmark was made to address the generalization gap.” This is true, but only partially. Yes, Procgen benchmark allows us to explicitly measure the generalization gap. But the extensive use of procedural generation also makes the environments very well-suited to evaluate sample efficiency. Improving sample efficiency is a more interesting problem when the agent is constantly facing new experiences, rather than continually encountering the same ones. So, in addition to what you mentioned, we also created Procgen benchmark with this in mind – in these environments, the need to generalize is baked into the sample efficiency problem.
Since even the sample efficiency metric provides an implicit evaluation of generalization, we find it reasonable, albeit not ideal, for the competition to slightly favor this metric.