Is anyone experiencing the same warnings/errors?

mtrazzi · July 8, 2020, 10:58am

Hi everyone,

Posting some warnings / errors I consistently have in my training logs so people can tell me if they’re experiencing the same crashes. (Errors from this log).

Dashboard crashes with error while attempting to bind on address ('::1', 8265, 0, 0): cannot assign requested address. I solved this by adding webui_host='127.0.0.1' in ray_init in train.py (cf. stackoverflow) on google colab, not sure i need to do the same for aicrowd submission (which would mean touching to train.py).
ls: cannot access '/outputs/ray-results/procgen-ppo/*/' this seems to be in how variables are set in run.sh. Don’t know why they would want to access ray-results early on.
given NumPy array is not writeable (solved by downgrading to torch 1.3.1 locally, but still unclear how to downgrade when submitting, cf. discussion)
[Errno 2] No such file or directory: 'merged-videos/training.mp4': seems to be on aicrowd side, but maybe we need to change how we log videos? see this example or this PR.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! ... you may need to pass an argument with the flag '--shm-size' to 'docker run'.: this seems to be on aicrowd server side, but maybe we need to change the docker file or do clever things?
The process_trial operation took 1.1798417568206787 seconds to complete, which may be a performance bottleneck: this is from just scaling the number of channels in impala baseline by 4x (so 16x the params). Have people been experiencing the same performance bottlenecks with other models?

jyotish · July 8, 2020, 12:07pm

Hello @mtrazzi

First of all, thanks for your effort in writing down these warnings/errors here. I’m sure that a lot of participants would have similar questions.

Dashboard crashes with error while attempting to bind on address ('::1', 8265, 0, 0): cannot assign requested address . I solved this by adding webui_host='127.0.0.1' in ray_init in train.py (cf. stackoverflow) on google colab , not sure i need to do the same for aicrowd submission (which would mean touching to train.py ).

You can ignore the dashboard error. You see that error because the port on which the dashboard is trying to bind is not available for it.

ls: cannot access '/outputs/ray-results/procgen-ppo/*/' this seems to be in how variables are set in run.sh. Don’t know why they would want to access ray-results early on.

We run on the evaluations on preemptive instances, which means that the node on which the evaluation is running on can shut down any time. Before starting the training, we check for any existing checkpoints and resume from that point. I understand that this is causing some confusion. We will hide these outputs in the next grader update.

given NumPy array is not writeable ( solved by downgrading to torch 1.3.1 locally, but still unclear how to downgrade when submitting , cf. discussion)

You can update the requirements.txt or edit your Dockerfile accordingly. Please make sure that you set "docker_build": true in your aicrowd.json file. If this is not set to true, we will not trigger a docker build and will use a default image.

[Errno 2] No such file or directory: 'merged-videos/training.mp4' : seems to be on aicrowd side, but maybe we need to change how we log videos? see this example or this PR.

This is not related to the issue you were mentioning. We generate a few videos for every few iterations and upload them at regular time intervals during training. These videos are shown on the GitLab issue page as well as the submission dashboard. This error basically means that we tried uploading the video bu,t there was no video. This typically happens when throughput is very low or, the rendering doesn’t work. If you are able to see a video on aicrowd website and the dashboard, rendering the video is not the problem. It looks like we missed some error handling here (though it will not effect your evaluation in any way). We will fix this in the next grader update.

WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! ... you may need to pass an argument with the flag '--shm-size' to 'docker run'. : this seems to be on aicrowd server side, but maybe we need to change the docker file or do clever things?

Yes, docker by default allocates very less shared memory. Typically, I do not expect this to have a drastic impact on performance. In terms of throughput, we were getting similar results on a physical machine and on the evaluation pipeline. But if you want us to increase this, please reach out to us and we will be glad to look into it.