I see this stack trace when I try to submit an agent that worked previously:
2019-05-24T17:42:52.825782788Z root
2019-05-24T17:43:05.17870298Z INFO:mlagents_envs:Start training by pressing the Play button in the Unity Editor.
2019-05-24T17:43:35.184856349Z Traceback (most recent call last):
2019-05-24T17:43:35.184901058Z File "run.py", line 56, in <module>
2019-05-24T17:43:35.184908019Z env = create_single_env(args.environment_filename, docker_training=args.docker_training)
2019-05-24T17:43:35.184929282Z File "/home/aicrowd/util.py", line 16, in create_single_env
2019-05-24T17:43:35.184932961Z env = ObstacleTowerEnv(path, **kwargs)
2019-05-24T17:43:35.184935919Z File "/srv/conda/lib/python3.6/site-packages/obstacle_tower_env.py", line 45, in __init__
2019-05-24T17:43:35.184939382Z timeout_wait=timeout_wait)
2019-05-24T17:43:35.184942214Z File "/srv/conda/lib/python3.6/site-packages/mlagents_envs/environment.py", line 69, in __init__
2019-05-24T17:43:35.184945802Z aca_params = self.send_academy_parameters(rl_init_parameters_in)
2019-05-24T17:43:35.184948806Z File "/srv/conda/lib/python3.6/site-packages/mlagents_envs/environment.py", line 491, in send_academy_parameters
2019-05-24T17:43:35.184952019Z return self.communicator.initialize(inputs).rl_initialization_output
2019-05-24T17:43:35.184954878Z File "/srv/conda/lib/python3.6/site-packages/mlagents_envs/rpc_communicator.py", line 80, in initialize
2019-05-24T17:43:35.184958142Z "The Unity environment took too long to respond. Make sure that :\n"
2019-05-24T17:43:35.184963164Z mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
2019-05-24T17:43:35.184968225Z The environment does not need user interaction to launch
2019-05-24T17:43:35.184972965Z The Academy and the External Brain(s) are attached to objects in the Scene
2019-05-24T17:43:35.184977708Z The environment and the Python interface have compatible versions.
I am definitely using v2.1 of the environment. Not sure what’s going on, but this happened several submissions in a row, and I see that nobody else has successfully submitted in a few days.
@unixpickle: We are looking into it. The only change that has been done recently was to update the env binary to v1.2, and it also was tested before being deployed. We acknowledge the problem you mention, and are looking into it as we speak.
Hoping to have an update soon on this thread.
Looping in @arthurj@harperj : We should figure out a way to nicely override the timeout_wait param from the agent size. Maybe we can expose an environment variable that overrides the timeout_wait parameter so that participants wouldnt have to manually set those parameters, and we can adjust them dynamically based on the current setup of the evaluator.
Looks like I am running into same issue as described above: Previously submitted code suddenly is not working, and debugging gives me the timeout error. timeout_wait=300 did not solve the issue, but I will try with longer values. Here is one of the failed submissions: https://gitlab.aicrowd.com/Miffyli/obstacletower-2019/issues/25 .
This also seems to be some instability in the evaluation binary thats being used. I will follow up with @harperj and @arthurj to see if we can pin point the exact cause.
Looks like that other submission is on its way towards a timeout, as the agent and the evaluation binary donot seem to be able to communicate. Both are waiting for the first communication. Did you change the default ports of communication etc by any chance ?
We only updated one of the model files after a successful submission. Worker ID is set to zero and environment filename to None.
Looks like that submission (29) did eventually manage to run through. Did you happen to re-queue the submission or did it work out on itself? Edit: Looks like you re-queued it.
Edit2: Looks like that submission (29) is now stuck and does not finish correctly ^^’
Edit3: Looks like we figured it out: When testing locally outside docker images, the OT game ended up stuck with empty screen (skybox) and our code also waited for the OT environment. We first had to launch our agent code and wait until it tries to connect to OT environment (“Start training by pressing the Play…”), after which we launched OT env and the evaluation started successfully. The solution for submission was to modify agent code to first create OT environment before everything else, including larger imports (TF, Torch, etc), and then proceed with creating/loading agents.
Maybe we should add a FAQ section to the starterkit which could begin as a separate markdown file (linked from the README) as a pull request from you !