Impala Pytorch Baseline Bug(s)

gregory_h_eales · June 11, 2020, 10:04am

The Pytorch Impala baseline fails but then continues to run. It runs at a throughput less than 100 and even occasionally goes negative even though the dashboard log shows the model finished training in about an hour at a reasonable pace. I’ve also been having low throughput with my own PyTorch models. Has anyone else been experiencing similar issues?

mohanty · June 12, 2020, 7:30pm

You mean low throughput on the evaluation server or also locally ?

In any case, @jyotish: do you want to share the throughput spreadsheet you prepared to tune the num_workers etc for the baselines ?
It might be helpful for participants to get a rough baseline of what throughput to expect in different scenarios.

jyotish · June 12, 2020, 8:08pm

Hello @gregory_eales,

The “fails but then continues” is because we re-evaluated your submission as it failed the first time due to an internal glitch.

Following is the throughput vs ray worker configuration on an evaluation node (with 1 P100 and 8 vCPUs) for impala baseline (tensorflow version).

throughput	workers	envs_per_worker	cpus_per_worker
757.7564057	2	2	1
923.4482437	4	2	1
993.133285	6	2	1
1006.230928	5	2	1
1107.859696	7	2	1
1109.078469	2	4	1
1362.100739	4	4	1
1409.114958	5	4	1
1457.701511	6	4	1
1460.446554	2	8	1
1534.667546	7	4	1
1613.769406	2	12	1
1732.683079	4	8	1
1735.013415	5	8	1
1756.762717	6	8	1
1803.119381	2	20	1
1811.492029	7	8	1
1824.598827	5	12	1
1827.744181	4	12	1
1831.147102	2	16	1
2035.535199	4	16	1
2106.670996	4	20	1
2108.46658	5	16	1
2128.366856	6	12	1
2206.309038	6	16	1
2218.835295	7	12	1
2224.173316	7	16	1
2243.448792	5	20	1
2291.233425	6	20	1
2329.457026	7	20	1

tim_whitaker · June 26, 2020, 2:37am

I just want to confirm that the pytorch impala baseline model does seem bugged. No changes, same model included in the starter kit. use_pytorch is True in the yaml config. Extremely low throughput.

Here is something I see in the metrics logs

(pid=102) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.

bob_wei · July 6, 2020, 5:15pm

I am also experiencing the same issues. Any idea if this will be fixed for pytorch?

jyotish · July 7, 2020, 5:46am

Hello @bob_wei

Can you try using pytorch 1.3.x version?

mtrazzi · July 8, 2020, 10:32am

@jyotish Downgrading to 1.3.x works locally, thanks.

How do you make sure it installs 1.3.x when submitting? I added torch==1.3.1 to requirements.txt (since RUN pip install -r requirements.txt --no-cache-dir is called in Dockerfile) but print(torch.__version__) in impala agent still gave me 1.5.x.

jyotish · July 8, 2020, 12:10pm

Hello @mtrazzi

You need to set "docker_build": true in your aicrowd.json file. Without this, we will use the default image and there will not be any pip installs from your requirements.txt.