Solution Summary (8th) and Thoughts

Hey everyone! I wanted to share my solution and find out what everyone else did before I forget. My solution isn’t anything special and is mostly a tuned baseline with a few tricks. I also want to say thank you to the organizers from AICrowd and OpenAI and thank you to AWS for the credits! I certainly learnt a lot from this competition.

Algorithm

I used PPO for training. After trying IMPALA and APEX, PPO outperformed both.

Environment

I used a modified framestack wrapper to take the first and last of every 4 frames. This was so I could get more temporal information with half the amount of data and it worked quite a bit better than no framestack and the full 4 framestack.

I tried data augmentations across the framestack but this did not help PPO/IMPALA performance. It did slightly improve APEX performance but it still performed worse than PPO. No Augs | All Augs | Flip | Rotate | Crop Resize | Translate | Pad Crop |

Model

I’d say my biggest improvement to performance was adjusting the model. Increasing the model width to [32,64,64] channels drastically improved performance. Pooling the final layer also helped since beforehand most of the model parameters were in the penultimate hidden layer. I tried a few other network variations (self-attention, SE-block, pretrained resnet/mobilenet, depthwise convs, deeper and wider nets), but for some strange reason the performance was always slightly worse. One of the weirdest things was when I tried to replace the maxpool layer with a stride 2 conv but this completely destroyed the training - if anyone else saw this and knows why please let me know. A good learning rate schedule also helped here.

Reward

I tried reward shaping (np.sign(r)*(np.sqrt(np.abs(r)+1)-1)+0.001*r) and it seemed to help locally with PPO but on larger tests, performance didn’t improve.

I tried adding an intrinsic reward signal (Random Network Distillation) to help exploration for faster training but performance remained approximately the same. However, I didn’t get the original RND method working where you ignore the episode boundaries so that you get exploration across episodes, and the authors say that was important, so that might have been the issue.

Other parameters

I ran a few grid searches and tested out a few parameter combinations, resulting in the following:

gamma: 0.99
kl_coeff: 0.2
kl_target: 0.01
lambda: 0.9
rollout_fragment_length: 200
train_batch_size: 8192
sgd_minibatch_size: 1024
num_sgd_iter: 3

Things that didn’t work

  • weight decay
  • different network architectures
  • sticky actions
  • image augmentations
  • intrinsic reward
  • network normalisations

Biggest improvements (descending)

  • Model width and pooling
  • Modified framestack
  • Good hparams

Problems

  • For a while I tried to upgrade ray to the latest version but things kept breaking and my submissions all failed.
  • Pytorch performance was terrible compared to tensorflow, so I couldn’t use pytorch. I think this is related to the ray version.
  • Took a while to get used to the sagemaker pipeline but eventually got a script working that could deploy models from my local computer.

Questions

  • Did anyone use a recurrent net to good effect?
  • Did anyone use a custom/different algorithm? I saw some trying PPG.
  • Did anyone get data augmentation working?
  • Did anyone use a pretrained network?
  • What led to the biggest improvements in performance for you?
  • My toughest environment was plunder (score ~10). Did anyone get good performance on this? What was your toughest env?
9 Likes

Thank you for sharing! I will write up a small bit about what I did as well. I find it so interesting that you scored so well with your solution. I implemented everything you did, but was unable to improve my score much. I implemented a deeper version of impala (talked about briefly in the coinrun paper) with channels of [32, 64, 64, 64, 64], and [32, 64, 128, 128, 128] with Fixup Initialization (https://arxiv.org/abs/1901.09321). I think I may have gone overboard on tinkering with model architectures, but I was convinced with this network as I did so well in the warm up round (and semi-decently in round 1) with it. I did try shallower networks for a few submissions but I got worse scores everytime. I did use pytorch and I’m thinking that may have had some effect on performance that I was missing. Wish I would have implemented in tensorflow as well.

I tried impala, apex and rainbow as well, but found PPO worked best and was the most consistent.

I implemented the improvements from this paper https://arxiv.org/pdf/2005.12729.pdf. Namely normalized rewards, orthogonal initialization and learning rate annealing. I played with all sorts of different LR schedules and entropy coefficient schedules. Lots and lots of hyperparameter tuning. I also tried a number of different reward scaling schemes as well, including log scales, and intrinsic bonuses (histogram based curiosity, bonuses for collecting rewards fast, penalties for dying). None of these payed off when evaluated on all environments, even though some environments showed improvement.

I played with all sorts of framestacks. All combinations of 4, 3, 2 and skipped frames. I also found good success with a frame difference stack. I think this helps a lot more in some environments than others.

I also did some clever environment wrapping. I replicated sequential_levels, and saw great success in some environments, and terrible performance in others. Wasn’t able to identify why this was the case. Also played with some “checkpointing” ideas. These didn’t work in the end, but some cool ideas there for sure.

I implemented a generalized version of action reduction. Worked really well for the warm up round and round 1, when there’s fewer edge cases to consider but my solution ended up being brittle for all environments (especially the private environments). It worked really well for the warm up round and round 1, when there’s fewer edge cases to consider. I think there’s a lot of potential for future research with action reduction especially. Very promising and shows drastic performance increase.

And lastly, I implemented some ensembling techniques that I’m writing a paper on.

I did explore augmentations as well. Crops, scaling and color jitter. I think this is a technique that would have helped the generalization track and not so much on sample efficiency. For the extra computation time, it did not seem worth it to add. But for generalization, it may be super important. Who knows. I really think the generalization should have been given some more thought to this competition. It feels like an afterthought, when it really should have been a primary goal for this round.

The hardest part of this competition was optimizing for so many different environments. There was a lot of back and forth where I would zero in on optimizing my weaknesses and would focus on plunder or bigfish and could get upwards of 18 for plunder and 25 for bigfish but then miner and starpilot would suffer. This happenend so many times. I think implementing a method for testing on all environments locally would have been huge instead of the manual approach I was using.

Notably, I did struggle with the private environments, hovercraft and safezone. My guess is these were the biggest detractors to my score with returns of ~2.3 and ~1.7 respectively. I’m really hoping these environments are released so that I can see what’s going on.

While I ended up in 11th place, I think I ended up in 1st for the most submissions (sorry for using up your compute budget AICrowd) :slight_smile:

Can’t wait to read more about everyone’s solutions. It’s been real fun to follow everyone’s progress and to have a platform to try so many ideas. I think procgen is really an amazing competition environment. Looking forward to the next one.

5 Likes