Unusually large tensor in the starter code

I receive OOM error when running the starter code. Here’s the error message I receive

== Status ==
Memory usage on this node: 3.6/62.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 7/10 CPUs, 0.9/1 GPUs, 0.0/13.96 GiB heap, 0.0/6.4 GiB objects
Result logdir: /home/aptx4869/ray_results/procgen-ppo
Number of trials: 1 (1 RUNNING)
±------------------------------±---------±------+
| Trial name | status | loc |
|-------------------------------±---------±------|
| PPO_procgen_env_wrapper_00000 | RUNNING | |
±------------------------------±---------±------+

(pid=5272) 2020-06-24 09:26:36,869 INFO trainer.py:421 – Tip: set ‘eager’: true or the --eager flag to enable TensorFlow eager execution
(pid=5272) 2020-06-24 09:26:36,870 INFO trainer.py:580 – Current log_level is WARN. For more information, set ‘log_level’: ‘INFO’ / ‘DEBUG’ or use the -v and -vv flags.
(pid=5272) 2020-06-24 09:26:44,889 INFO trainable.py:217 – Getting current IP.
(pid=5272) 2020-06-24 09:26:44,889 WARNING util.py:37 – Install gputil for GPU system monitoring.
Traceback (most recent call last):
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/tune/trial_runner.py”, line 467, in _process_trial
result = self.trial_executor.fetch_result(trial)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py”, line 431, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/worker.py”, line 1515, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ResourceExhaustedError): ray::PPO.train() (pid=21577, ip=192.168.1.102)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/tensorflow/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/tensorflow/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,32,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node default_policy_1/tower_1/gradients_1/default_policy_1/tower_1/model_1/max_pooling2d_1/MaxPool_grad/MaxPoolGrad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

ray::PPO.train() (pid=21577, ip=192.168.1.102)
File “python/ray/_raylet.pyx”, line 459, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 462, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 463, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 417, in ray._raylet.execute_task.function_executor
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/rllib/agents/trainer.py”, line 498, in train
raise e
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/rllib/agents/trainer.py”, line 484, in train
result = Trainable.train(self)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/tune/trainable.py”, line 261, in train
result = self._train()
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py”, line 151, in _train
fetches = self.optimizer.step()
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py”, line 212, in step
self.per_device_batch_size)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/rllib/optimizers/multi_gpu_impl.py”, line 257, in optimize
return sess.run(fetches, feed_dict=feed_dict)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/tensorflow/python/client/session.py”, line 958, in run
run_metadata_ptr)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/tensorflow/python/client/session.py”, line 1181, in _run
feed_dict_tensor, options, run_metadata)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/tensorflow/python/client/session.py”, line 1359, in _do_run
run_metadata)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/tensorflow/python/client/session.py”, line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,32,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node default_policy_1/tower_1/gradients_1/default_policy_1/tower_1/model_1/max_pooling2d_1/MaxPool_grad/MaxPoolGrad (defined at /.conda/envs/procgen/lib/python3.7/site-packages/ray/rllib/agents/ppo/ppo_tf_policy.py:195) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

== Status ==
Memory usage on this node: 17.9/62.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0.0/1 GPUs, 0.0/13.96 GiB heap, 0.0/6.4 GiB objects
Result logdir: /home/aptx4869/ray_results/procgen-ppo
Number of trials: 1 (1 ERROR)
±------------------------------±---------±------+
| Trial name | status | loc |
|-------------------------------±---------±------|
| PPO_procgen_env_wrapper_00000 | ERROR | |
±------------------------------±---------±------+
Number of errored trials: 1
±------------------------------±-------------±-------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-------------------------------±-------------±-------------------------------------------------------------------------------------------------------|
| PPO_procgen_env_wrapper_00000 | 1 | /home/aptx4869/ray_results/procgen-ppo/PPO_procgen_env_wrapper_0_2020-06-24_09-18-39m1tpdxon/error.txt |
±------------------------------±-------------±-------------------------------------------------------------------------------------------------------+

== Status ==
Memory usage on this node: 17.9/62.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/12 CPUs, 0.0/1 GPUs, 0.0/13.96 GiB heap, 0.0/6.4 GiB objects
Result logdir: /home/aptx4869/ray_results/procgen-ppo
Number of trials: 1 (1 ERROR)
±------------------------------±---------±------+
| Trial name | status | loc |
|-------------------------------±---------±------|
| PPO_procgen_env_wrapper_00000 | ERROR | |
±------------------------------±---------±------+
Number of errored trials: 1
±------------------------------±-------------±-------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-------------------------------±-------------±-------------------------------------------------------------------------------------------------------|
| PPO_procgen_env_wrapper_00000 | 1 | /home/aptx4869/ray_results/procgen-ppo/PPO_procgen_env_wrapper_0_2020-06-24_09-18-39m1tpdxon/error.txt |
±------------------------------±-------------±-------------------------------------------------------------------------------------------------------+

Traceback (most recent call last):
File “train.py”, line 235, in
run(args, parser)
File “train.py”, line 229, in run
concurrent=True)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/tune/tune.py”, line 411, in run_experiments
return_trials=True)
File “/home/aptx4869/.conda/envs/procgen/lib/python3.7/site-packages/ray/tune/tune.py”, line 347, in run
raise TuneError(“Trials did not complete”, incomplete_trials)
ray.tune.error.TuneError: (‘Trials did not complete’, [PPO_procgen_env_wrapper_00000])

It seems that error occurs because of the unusually large tensor with shape[2048, 32, 32,32] but I have no idea where it comes from. My GPU has 12G memory. The only thing I change is the run.sh file, in which I increase the memory and CPU used by ray:

  export RAY_MEMORY_LIMIT=15000000000
  export RAY_CPUS=12
  export RAY_STORE_MEMORY=10000000000

Hello @the_raven_chaser

The impala baseline provided in the starter kit takes close to 15.8 GB of GPU memory. As a starting point, you can try setting num_workers: 1 in the experiment YAML file and see if it works. You can also try running nvidia-smi to check how much memory is being utilized by the trainer and the rollout worker. Based on that, you can try increasing num_workers to a higher number.

1 Like

Thank you so much and sorry for the late response.