The Pytorch Impala baseline fails but then continues to run. It runs at a throughput less than 100 and even occasionally goes negative even though the dashboard log shows the model finished training in about an hour at a reasonable pace. I’ve also been having low throughput with my own PyTorch models. Has anyone else been experiencing similar issues?
You mean low throughput on the evaluation server or also locally ?
In any case, @jyotish: do you want to share the throughput spreadsheet you prepared to tune the num_workers etc for the baselines ?
It might be helpful for participants to get a rough baseline of what throughput to expect in different scenarios.
The “fails but then continues” is because we re-evaluated your submission as it failed the first time due to an internal glitch.
Following is the throughput vs ray worker configuration on an evaluation node (with 1 P100 and 8 vCPUs) for impala baseline (tensorflow version).
I just want to confirm that the pytorch impala baseline model does seem bugged. No changes, same model included in the starter kit. use_pytorch is True in the yaml config. Extremely low throughput.
Here is something I see in the metrics logs
(pid=102) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.