So I was a little bored and decided to see how well I could play the procgen games myself.
Setup:
python -m procgen.interactive --distribution-mode easy --vision agent --env-name coinrun
First I tried each game for 5-10 episodes to figure out what the keys do, how the game works, etc.
Then I played each game 100 times and logged the rewards. Here are the results:
| Environment | Mean reward | Mean normalized reward |
|---|---|---|
| bigfish | 29.40 | 0.728 |
| bossfight | 10.15 | 0.772 |
| caveflyer | 11.69 | 0.964 |
| chaser | 11.23 | 0.859 |
| climber | 12.34 | 0.975 |
| coinrun | 9.80 | 0.960 |
| dodgeball | 18.36 | 0.963 |
| fruitbot | 25.15 | 0.786 |
| heist | 10.00 | 1.000 |
| jumper | 9.20 | 0.911 |
| leaper | 9.90 | 0.988 |
| maze | 10.00 | 1.000 |
| miner | 12.27 | 0.937 |
| ninja | 8.60 | 0.785 |
| plunder | 29.46 | 0.979 |
| starpilot | 33.15 | 0.498 |
The mean normalized score over all games was 0.882. It stayed relatively constant throughout the 100 episodes, i.e. I didn’t improve much while playing.
I’m not sure how useful this result would be as a “human benchmark” though - I could easily achieve ~1.000 score given enough time to think on each frame. Also, human visual reaction time is ~250ms, which at 15 fps would translate to us being at least 4 frames behind on our actions, which can be important for games like starpilot, chaser and some others.