Impala Pytorch Baseline Bug(s)

The Pytorch Impala baseline fails but then continues to run. It runs at a throughput less than 100 and even occasionally goes negative even though the dashboard log shows the model finished training in about an hour at a reasonable pace. I’ve also been having low throughput with my own PyTorch models. Has anyone else been experiencing similar issues?

You mean low throughput on the evaluation server or also locally ?

In any case, @jyotish: do you want to share the throughput spreadsheet you prepared to tune the num_workers etc for the baselines ?
It might be helpful for participants to get a rough baseline of what throughput to expect in different scenarios.

Hello @gregory_eales,

The “fails but then continues” is because we re-evaluated your submission as it failed the first time due to an internal glitch.

Following is the throughput vs ray worker configuration on an evaluation node (with 1 P100 and 8 vCPUs) for impala baseline (tensorflow version).

throughput workers envs_per_worker cpus_per_worker
757.7564057 2 2 1
923.4482437 4 2 1
993.133285 6 2 1
1006.230928 5 2 1
1107.859696 7 2 1
1109.078469 2 4 1
1362.100739 4 4 1
1409.114958 5 4 1
1457.701511 6 4 1
1460.446554 2 8 1
1534.667546 7 4 1
1613.769406 2 12 1
1732.683079 4 8 1
1735.013415 5 8 1
1756.762717 6 8 1
1803.119381 2 20 1
1811.492029 7 8 1
1824.598827 5 12 1
1827.744181 4 12 1
1831.147102 2 16 1
2035.535199 4 16 1
2106.670996 4 20 1
2108.46658 5 16 1
2128.366856 6 12 1
2206.309038 6 16 1
2218.835295 7 12 1
2224.173316 7 16 1
2243.448792 5 20 1
2291.233425 6 20 1
2329.457026 7 20 1
3 Likes

I just want to confirm that the pytorch impala baseline model does seem bugged. No changes, same model included in the starter kit. use_pytorch is True in the yaml config. Extremely low throughput.

Here is something I see in the metrics logs

(pid=102) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.

4 Likes

I am also experiencing the same issues. Any idea if this will be fixed for pytorch?

Hello @bob_wei

Can you try using pytorch 1.3.x version?

2 Likes

@jyotish Downgrading to 1.3.x works locally, thanks.

How do you make sure it installs 1.3.x when submitting? I added torch==1.3.1 to requirements.txt (since RUN pip install -r requirements.txt --no-cache-dir is called in Dockerfile) but print(torch.__version__) in impala agent still gave me 1.5.x.

Hello @mtrazzi

You need to set "docker_build": true in your aicrowd.json file. Without this, we will use the default image and there will not be any pip installs from your requirements.txt.

1 Like