Competition metric seems to favor sample efficiency over generalization gap

As far as I understand, the Procgen benchmark was made to address generalization gap. If mean normalized reward is used as the metrics on a small number of timesteps, is it not more likely to favor sample efficiency than generalization gap? If current algorithms cannot achieve the maximum train time reward in 8M timesteps, how do we disentangle sample efficiency and generalization gap?

I understand there must have been practical considerations taken for selecting the 8M timestep threshold. Would like to know the community’s and organizer’s thoughts on this.


Following is from the point of view of the competition. I agree it might be difficult to disentangle benefits of the method in sample efficiency and generalization with this limit.

From practical point of view, it levels the playing ground (somewhat) when the amount of training data is restricted. In Unity Obstacle Tower challenge, for example, one team trained for billions of timesteps, which was simply out of the reach of competitors without access to +50-core, multi-gpu setup. They can still use that hardware for faster evaluation of ideas/models/hyperparams though.

This also encourages for ideas other than “just brute-force it” with compute. It is easier to try out ideas when you know the limit of how long you should train :slight_smile:

1 Like

@dipam_chakraborty You are correct that the current competition metrics do somewhat favor sample efficiency over generalization. The decision to restrict training to 8M timesteps when measuring generalization is largely a practical one. If compute were not restricted, we would prefer to train agents for as much time as is required for performance to converge (when measuring generalization). Since we must restrict compute, this does indeed lead to an unfortunate coupling of sample efficiency and generalization.

You mention that “Procgen benchmark was made to address the generalization gap.” This is true, but only partially. Yes, Procgen benchmark allows us to explicitly measure the generalization gap. But the extensive use of procedural generation also makes the environments very well-suited to evaluate sample efficiency. Improving sample efficiency is a more interesting problem when the agent is constantly facing new experiences, rather than continually encountering the same ones. So, in addition to what you mentioned, we also created Procgen benchmark with this in mind – in these environments, the need to generalize is baked into the sample efficiency problem.

Since even the sample efficiency metric provides an implicit evaluation of generalization, we find it reasonable, albeit not ideal, for the competition to slightly favor this metric.


Thank you @Miffyli and @kcobbe for the insightful and helpful explanations. Looking forward to the competition.