After *exactly* 2hours of usage, ObstacleTowerEnv ends in SIGABRT, from gRPC

#1

Hello everybody!

Here is the error that I get from a Python Process that I spawn to handle one instance of the ObstacleTowerEnv, among many others:

E0712 18:36:48.452364988   30043 ev_epoll1_linux.cc:1061]    assertion failed: next_worker->initialized_cv

I am training by harvesting multiple instances of ObstacleTowerEnv with multiple processes, each environment being spawned with different worker_id (following a previous discussion: Running multiple instances).

Nevertheless, the issue occurs independantly of the number of environment/process spawned.

I have traced it back to gRPC, that is used in the client-server communication of each ObstacleTowerEnv instances.

Since it would terminate my harvesting processes with a SIGABRT, I meant to simply terminate the process, close the environment instance, and then restart a new process and a new environment instance --with another worker_id-- but it seems that there is something I am still not grasping.

Since I cannot skirt the problem, I rely on your good advice to guide me in some better directions please!

I am training with PyTorch and using the following packages, on Python 3.6.8 and Ubuntu 16.04.6 LTS (Xenial Xerus) (reproduced the error on Ubuntu 18.04.2 LTS (Bionic Beaver)) :

absl-py==0.7.1
astor==0.8.0
atari-py==0.2.3
atomicwrites==1.3.0
attrs==19.1.0
backcall==0.1.0
cloudpickle==1.2.1
cycler==0.10.0
decorator==4.4.0
dill==0.3.0
docopt==0.6.2
future==0.17.1
gast==0.2.2
google-pasta==0.1.7
grpcio==1.11.1
gym==0.13.1
gym-rock-paper-scissors==0.1
h5py==2.9.0
importlib-metadata==0.18
ipdb==0.12
ipython==7.6.1
ipython-genutils==0.2.0
jedi==0.14.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
matplotlib==3.1.1
mlagents-envs==0.6.2
more-itertools==7.1.0
numpy==1.16.1
-e git+https://github.com/Unity-Technologies/obstacle-tower-env/@474fbf00564ae1357373b1e2d72dcb9af095540b#egg=obstacle_tower_env
opencv-python==4.1.0.25
packaging==19.0
pandas==0.24.2
parso==0.5.0
pexpect==4.7.0
pickleshare==0.7.5
Pillow==5.4.1
pluggy==0.12.0
prompt-toolkit==2.0.9
protobuf==3.6.1
ptyprocess==0.6.0
py==1.8.0
pyglet==1.3.2
Pygments==2.4.2
PyOpenGL==3.1.0
pyparsing==2.4.0
pytest==3.10.1
python-dateutil==2.8.0
pytz==2019.1
PyYAML==5.1.1
-e git+https://github.com/Danielhp95/Generalized-RL-Self-Play-Framework/@bd872b3b547a008fe126a3584b83448157f5ee3d#egg=regym
scipy==1.3.0
seaborn==0.9.0
six==1.12.0
tensorboard==1.12.0
tensorboardX==1.8
tensorflow==1.12.0
tensorflow-estimator==1.14.0
termcolor==1.1.0
torch==1.1.0
torchvision==0.3.0
tqdm==4.32.2
traitlets==4.3.2
wcwidth==0.1.7
Werkzeug==0.15.4
wrapt==1.11.2
zipp==0.5.2

EDIT:

  1. I am realizing that it might be important to mention the following, with regards to the spawning of the processes and the creation of the environment instances: I call the ObstacleTowerEnv() function in the main process (many times), and then pass each instance as argument to a new process that communicates with the main process via Queues.
    If I create the environment inside the spawned process, I would end up with UnityTimedOutException…

  2. I am using PyTorch, which implements its own flavours of multiprocessing, that I am using as well. At some point, I assumed that it was colliding with ObstacleTowerEnv’s own multiprocessing needs but my inquires were not fruitful…

#2

I have not per say found the issue but it seems that both hurdles (i.e. (1) create environment instances within spawned/forked processes without raising UnityTimedOutException, and (2) use the environment instances without getting gRPC to bug after exactly 2 hours) vanished once the followings are set:

torch.multiprocessing.set_start_method('forkserver')
torch.multiprocessing.set_sharing_strategy('file_system')

Sources:

  1. https://pytorch.org/docs/master/multiprocessing.html#multiprocessing-cuda-sharing-details
  2. https://github.com/pytorch/pytorch/issues/11201

Hopefully it will be helpful to more than me, so good luck to you who is reading this :slight_smile: !