How to find subtle implementation details

A very fun (also annoying) part of deep learning is that subtle differences in implementations are very easy to miss, for example, there are differences in default values in torch and tensorflow, or in rllib and openai baselines … some of these differences actually helped boost performance significantly in Round 1

Is there any good way to find these subtle details effectively, apart from thoroughly reading code and small unit tests?

Good question. I think reading code, research papers and experimentation is the only way. But with your post here, I’m left wondering if I missed something in the torch/tf implementation differences since you ended with such a good score in round 1!

Haha sorry if I left you hanging there, these are the differences in torch and tf I found can make a difference if used in the correct way, xavier init vs kaiming init, zero vs non-zero bias init, epsilon parameter in adam. Interestingly, I noticed the differences after moving from torch to tensorflow, because ray still uses placeholders for tensorflow which made it a pain to work with, but my torch code is slower than tensorflow so I had to give up some compute time.

We have a comparison for the impala baseline provided in the starter kit and the OpenAI’s procgen baseline here,

1 Like

Hello @jyotish

Yeah, you guys did a really awesome job of matching the baseline, though you left out some details (intentially? :sweat_smile:). I was actually surprised how many small one-line optimizations are hidden in openai baselines’ version … though not all of them helped in my case.

This one I believe is a bug but haven’t got any response about it though. Deep neural nets are so awesome everything works fine even with this bug present.

Hello everyone, I observe a significant performance difference between Torch and TF with same experimental setting.
Follow @ttom, @dipam_chakraborty and @jyotish , I try to reproduce the 8th solution from Solution Summary (8th) and Thoughts with Pytorch. However, with same settings, i.e. default PPO algorithm, NN model, weight init, hyperparameters and epsilon parameter in adam, TF significantly outperforms pytorch in procgen, e.g. TF scores 20 more than Pytorch in starpilot . What’s more, TF runs faster than Pytorch with about 15 mins, when environment time-steps is 8M. I try different binaries of ray (0.8.6, 0.8.7) and versions of pytorch (1.4.0, 1.5.0, 1.6.0, 1.7.0), but get the same results.
I also notice that TF and Pytorch have different padding strategies, see https://stackoverflow.com/questions/61422046/resnet-model-of-pytorch-and-tensorflow-give-different-results-when-stride-2
Is there any suggestions to this problem? Did anyone get high scores in the competition with Pytorch?

Hi @lars12llt

Our full code is in Pytorch. However, I wrote entirely custom code on Pytorch for this competition as I was completely unfamiliar with rllib and wanted fine grained control over the entire code. My implementation works by basically subclassing TorchPolicy in rllib and writing the full training code in the learn_on_batch function. This admittedly removes rllib’s distributed learning benefits but allowed me to get comparable speed and score with Pytorch. Sorry I haven’t released the code yet, will be doing that soon.

1 Like

@lars12lit just for some extra data, I used pytorch and rllib’s PPO. Seems like a significant difference between where I ended up (11th place) and everyone else in the top 10. I did a lot of tuning too. My hunch is pytorch is the culprit.

@tim_whitaker

I agree. I also tried a lot of tunning in pytorch, including BOHB and PBT searching, but they did not help a lot. Maybe we should rewrite the full training code as @dipam_chakraborty did in rllib, or turn to tensorflow 1.x.

1 Like

It could be the weight initialization, as pytorch uses he_uniform by default and tensorflow uses glorot_uniform. Using tensorflow with glorot_uniform I get 42 score on starpilot, while using tensorflow with he_uniform I get 19.