The Pytorch Impala baseline fails but then continues to run. It runs at a throughput less than 100 and even occasionally goes negative even though the dashboard log shows the model finished training in about an hour at a reasonable pace. I’ve also been having low throughput with my own PyTorch models. Has anyone else been experiencing similar issues?
You mean low throughput on the evaluation server or also locally ?
In any case, @jyotish: do you want to share the throughput spreadsheet you prepared to tune the num_workers etc for the baselines ?
It might be helpful for participants to get a rough baseline of what throughput to expect in different scenarios.
Hello @gregory_eales,
The “fails but then continues” is because we re-evaluated your submission as it failed the first time due to an internal glitch.
Following is the throughput vs ray worker configuration on an evaluation node (with 1 P100 and 8 vCPUs) for impala baseline (tensorflow version).
throughput | workers | envs_per_worker | cpus_per_worker |
---|---|---|---|
757.7564057 | 2 | 2 | 1 |
923.4482437 | 4 | 2 | 1 |
993.133285 | 6 | 2 | 1 |
1006.230928 | 5 | 2 | 1 |
1107.859696 | 7 | 2 | 1 |
1109.078469 | 2 | 4 | 1 |
1362.100739 | 4 | 4 | 1 |
1409.114958 | 5 | 4 | 1 |
1457.701511 | 6 | 4 | 1 |
1460.446554 | 2 | 8 | 1 |
1534.667546 | 7 | 4 | 1 |
1613.769406 | 2 | 12 | 1 |
1732.683079 | 4 | 8 | 1 |
1735.013415 | 5 | 8 | 1 |
1756.762717 | 6 | 8 | 1 |
1803.119381 | 2 | 20 | 1 |
1811.492029 | 7 | 8 | 1 |
1824.598827 | 5 | 12 | 1 |
1827.744181 | 4 | 12 | 1 |
1831.147102 | 2 | 16 | 1 |
2035.535199 | 4 | 16 | 1 |
2106.670996 | 4 | 20 | 1 |
2108.46658 | 5 | 16 | 1 |
2128.366856 | 6 | 12 | 1 |
2206.309038 | 6 | 16 | 1 |
2218.835295 | 7 | 12 | 1 |
2224.173316 | 7 | 16 | 1 |
2243.448792 | 5 | 20 | 1 |
2291.233425 | 6 | 20 | 1 |
2329.457026 | 7 | 20 | 1 |
I just want to confirm that the pytorch impala baseline model does seem bugged. No changes, same model included in the starter kit. use_pytorch is True in the yaml config. Extremely low throughput.
Here is something I see in the metrics logs
(pid=102) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
I am also experiencing the same issues. Any idea if this will be fixed for pytorch?
@jyotish Downgrading to 1.3.x works locally, thanks.
How do you make sure it installs 1.3.x when submitting? I added torch==1.3.1
to requirements.txt
(since RUN pip install -r requirements.txt --no-cache-dir
is called in Dockerfile) but print(torch.__version__)
in impala agent still gave me 1.5.x.
Hello @mtrazzi
You need to set "docker_build": true
in your aicrowd.json
file. Without this, we will use the default image and there will not be any pip installs from your requirements.txt.