Why I was blocked from uploading new models for 3 days

joe_booth · July 8, 2019, 8:12pm

I’ve spent countless hours over the past 3 days trying to figure out why I could not upload/evaluate a new model nor reproduce the problem locally.

Basically, there is a breaking change whereby existing code will no longer run server-side - It would have saved me hours had there been a better error and/or someplace to look for notifications (I don’t think it helps to have 2 unity repros with issue trackers as well as the ai-crowd message board)

June 13th, was my last successful upload of a model.
On July 5th I tried to upload a new model - the only change to my code base the addition of the model and a reference to that model.
When aicrowd-bot posted the log: it just said this:

2019-07-06T07:13:55.1056876Z root
2019-07-06T07:13:55.129380334Z Traceback (most recent call last):
2019-07-06T07:13:55.12943787Z   File "run_evaluation.py", line 7, in <module>
2019-07-06T07:13:55.129471922Z     import gym
2019-07-06T07:13:55.129478407Z ModuleNotFoundError: No module named 'gym'

I learned that we can now do debug submits: Announcement: Debug your submissions however, all it gave was the same log
I tried to reproduce locally, however, the build.sh script was giving me the error: AttributeError: /srv/conda/bin/python: undefined symbol: archive_errno
I thought maybe a conda or pip package update may have broken something so I manually tied each one to the valid version from June 13th
I thought there may be some local issue with my docker, so I cleaned, deleted, reset
I ran pip install --upgrade aicrowd-repo2docker and saw that it updated. This solved my local issue and was able to reproduce the server side error.
Given that aicrowd-repo2docker had been updated, i thought to look at the commit logs and found that this https://github.com/Unity-Technologies/obstacle-tower-challenge/commit/99c68faf2ed0f01ee8bc3e411bbdd4e85484a733 removed source activate base from run.sh

Note: I still can not test locally - the agent code runs, but the environment docker immediately drops out. I also have to manually delete the docker image to force it to rebuilt (this was not the case prior to the aicrowd-repo2docker upgrade

unixpickle · July 8, 2019, 8:28pm

I had the same ModuleNotFoundError as you. I followed the solutions in

arthurj · July 8, 2019, 10:19pm

Hi @joe_booth, I am sorry you’ve run into these issues. The problem is due to updates AICrowd made to their evaluation system mid-contest. @mohanty or @shivam should be able to provide more context along with some support on getting your submission working under the new system.

mohanty · July 9, 2019, 9:35am

Hi @joe_booth,

Apologies for the trouble you had to go through and for the lost time.
We did announce the updates to the aicrowd-repo2docker and the update to the run.sh on the AIcrowd forums, but you are right we could have posted is uniformly across all the communication channels : the github issue trackers and the forums.

Regarding better ways to provide logs back to the participants, we do understand the need for that, and are working on figuring out a better solution. From a technical point of view, it is actually straight forward for us to give you access to the whole build and evaluation logs, which would make the debugging process much easier for you, but that also opens the possibility of participants trying to game the system by leaking information out of the evaluation setup. It might be argued that in a simple reinforcement learning setup like this, it might not be a huge risk, but in some other competitions, it opens up the risk of participants intentionally or unintentionally leaking out the ground truth. With these constraints in mind, we are still working on coming up with a unified solution which finds a good balance between limiting the risk of information leak, while still making debugging easier for users, and are open to your feedback and suggesstions regarding the same.

The question about the mid-competition changes to the evaluation system, those changes are scheduled incremental changes that are site wide and are not tied to the timeline of an individual changes.
Given reproducibility of the solutions is a key goal for us, and we depend on docker base images like nvidia/cuda, it is important to keep incrementally accepting the upstream changes to the stable base image(s), else sooner or later the individual submissions would anyway break and not be reproducible.

If it helps, we might let participants pin their aicrowd-repo2docker version to their submissions, which would allow retrospective builds even when you are not using the most updated version of aicrowd-repo2docker.

Thank you for your patience and your active interest and participation in the competition, and apologies again for the troubles you faced.

Cheers,
Mohanty

joe_booth · July 10, 2019, 4:15am

Thanks @mohanty! One problem I’m still having is running the test seeds locally (100-105) - when I invoke the environment docker it ends. I’m trying on MacOS

mohanty · July 10, 2019, 9:14am

@joe_booth: The evalutor internally uses a custom binary which is not public, but @arthurj might share more information about test seeds locally in the publicly available environment.

joe_booth · July 10, 2019, 4:24pm

it’s when I try and run it local per these steps: https://github.com/Unity-Technologies/obstacle-tower-challenge#run-docker-image

it used to work, but not since the update of aicrowd-repo2docker - @arthurj, does it work for you guys on Mac?

anssi · July 10, 2019, 5:18pm

Try running the environment without docker with same environment variables and port argument as in the examples. This works for me when I test my submission image.

joe_booth · July 12, 2019, 12:09am

@Miffyli - which platform are you running on? I can not get it working in MacOS and from what I read, docker does not support --network=host on Mac

anssi · July 13, 2019, 6:51pm

I use Linux (Ubuntu 16.04), and I dunno it works for other platforms.