First of all, thanks for your effort in writing down these warnings/errors here. I’m sure that a lot of participants would have similar questions.
Dashboard crashes with
error while attempting to bind on address ('::1', 8265, 0, 0): cannot assign requested address . I solved this by adding
train.py (cf. stackoverflow) on google colab , not sure i need to do the same for aicrowd submission (which would mean touching to
You can ignore the dashboard error. You see that error because the port on which the dashboard is trying to bind is not available for it.
ls: cannot access '/outputs/ray-results/procgen-ppo/*/' this seems to be in how variables are set in run.sh. Don’t know why they would want to access ray-results early on.
We run on the evaluations on preemptive instances, which means that the node on which the evaluation is running on can shut down any time. Before starting the training, we check for any existing checkpoints and resume from that point. I understand that this is causing some confusion. We will hide these outputs in the next grader update.
given NumPy array is not writeable ( solved by downgrading to torch 1.3.1 locally, but still unclear how to downgrade when submitting , cf. discussion)
You can update the
requirements.txt or edit your
Dockerfile accordingly. Please make sure that you set
"docker_build": true in your
aicrowd.json file. If this is not set to
true, we will not trigger a docker build and will use a default image.
[Errno 2] No such file or directory: 'merged-videos/training.mp4' : seems to be on aicrowd side, but maybe we need to change how we log videos? see this example or this PR.
This is not related to the issue you were mentioning. We generate a few videos for every few iterations and upload them at regular time intervals during training. These videos are shown on the GitLab issue page as well as the submission dashboard. This error basically means that we tried uploading the video bu,t there was no video. This typically happens when throughput is very low or, the rendering doesn’t work. If you are able to see a video on aicrowd website and the dashboard, rendering the video is not the problem. It looks like we missed some error handling here (though it will not effect your evaluation in any way). We will fix this in the next grader update.
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! ... you may need to pass an argument with the flag '--shm-size' to 'docker run'. : this seems to be on aicrowd server side, but maybe we need to change the docker file or do clever things?
Yes, docker by default allocates very less shared memory. Typically, I do not expect this to have a drastic impact on performance. In terms of throughput, we were getting similar results on a physical machine and on the evaluation pipeline. But if you want us to increase this, please reach out to us and we will be glad to look into it.