UnityTimeOutException in evaluation

unixpickle · May 25, 2019, 6:15pm

I see this stack trace when I try to submit an agent that worked previously:

2019-05-24T17:42:52.825782788Z root
2019-05-24T17:43:05.17870298Z INFO:mlagents_envs:Start training by pressing the Play button in the Unity Editor.
2019-05-24T17:43:35.184856349Z Traceback (most recent call last):
2019-05-24T17:43:35.184901058Z   File "run.py", line 56, in <module>
2019-05-24T17:43:35.184908019Z     env = create_single_env(args.environment_filename, docker_training=args.docker_training)
2019-05-24T17:43:35.184929282Z   File "/home/aicrowd/util.py", line 16, in create_single_env
2019-05-24T17:43:35.184932961Z     env = ObstacleTowerEnv(path, **kwargs)
2019-05-24T17:43:35.184935919Z   File "/srv/conda/lib/python3.6/site-packages/obstacle_tower_env.py", line 45, in __init__
2019-05-24T17:43:35.184939382Z     timeout_wait=timeout_wait)
2019-05-24T17:43:35.184942214Z   File "/srv/conda/lib/python3.6/site-packages/mlagents_envs/environment.py", line 69, in __init__
2019-05-24T17:43:35.184945802Z     aca_params = self.send_academy_parameters(rl_init_parameters_in)
2019-05-24T17:43:35.184948806Z   File "/srv/conda/lib/python3.6/site-packages/mlagents_envs/environment.py", line 491, in send_academy_parameters
2019-05-24T17:43:35.184952019Z     return self.communicator.initialize(inputs).rl_initialization_output
2019-05-24T17:43:35.184954878Z   File "/srv/conda/lib/python3.6/site-packages/mlagents_envs/rpc_communicator.py", line 80, in initialize
2019-05-24T17:43:35.184958142Z     "The Unity environment took too long to respond. Make sure that :\n"
2019-05-24T17:43:35.184963164Z mlagents_envs.exception.UnityTimeOutException: The Unity environment took too long to respond. Make sure that :
2019-05-24T17:43:35.184968225Z 	 The environment does not need user interaction to launch
2019-05-24T17:43:35.184972965Z 	 The Academy and the External Brain(s) are attached to objects in the Scene
2019-05-24T17:43:35.184977708Z 	 The environment and the Python interface have compatible versions.

I am definitely using v2.1 of the environment. Not sure what’s going on, but this happened several submissions in a row, and I see that nobody else has successfully submitted in a few days.

mohanty · May 26, 2019, 10:23am

@unixpickle: We are looking into it. The only change that has been done recently was to update the env binary to v1.2, and it also was tested before being deployed. We acknowledge the problem you mention, and are looking into it as we speak.
Hoping to have an update soon on this thread.

mohanty · May 26, 2019, 10:46am

@unixpickle: Can you also try a submission where you pass a timeout_wait parameter (https://github.com/Unity-Technologies/obstacle-tower-env/blob/master/obstacle_tower_env.py#L26) during env initialization and set it to something like 900 just to be safe ?

unixpickle · May 26, 2019, 3:04pm

@mohanty looks like your suggestion worked! Submission runs now.

mohanty · May 26, 2019, 3:16pm

Looping in @arthurj @harperj : We should figure out a way to nicely override the timeout_wait param from the agent size. Maybe we can expose an environment variable that overrides the timeout_wait parameter so that participants wouldnt have to manually set those parameters, and we can adjust them dynamically based on the current setup of the evaluator.

unixpickle · May 26, 2019, 4:47pm

Spoke too soon. Just did another submission and got another timeout after 998s seconds (my timeout was set to 900). Must be non-deterministic.

joe_booth · May 27, 2019, 5:50am

@mohanty - would you post the logs on my submission here - https://gitlab.aicrowd.com/joe_booth/obstacle-tower-challenge/issues/121 - I’m not sure if it is the same problem. thanks

anssi · July 8, 2019, 6:06pm

Looks like I am running into same issue as described above: Previously submitted code suddenly is not working, and debugging gives me the timeout error. timeout_wait=300 did not solve the issue, but I will try with longer values. Here is one of the failed submissions: https://gitlab.aicrowd.com/Miffyli/obstacletower-2019/issues/25 .

Edit: timeout_wait=900 did not help either, still getting same error :/. https://gitlab.aicrowd.com/Miffyli/obstacletower-2019/issues/26

mohanty · July 9, 2019, 9:49am

@Miffyli: I just requeued the submission and it seems to work : https://gitlab.aicrowd.com/Miffyli/obstacletower-2019/issues/26

This also seems to be some instability in the evaluation binary thats being used. I will follow up with @harperj and @arthurj to see if we can pin point the exact cause.

Thanks,
Mohanty

anssi · July 9, 2019, 9:50am

@mohanty: That’s odd o: . I have another debug run going on right now, and it seems to be stuck. So far 500 seconds waited and no luck: https://gitlab.aicrowd.com/Miffyli/obstacletower-2019/issues/29.

But I will keep trying every now and then, if it would eventually work out. Thanks for the help!

mohanty · July 9, 2019, 10:05am

Looks like that other submission is on its way towards a timeout, as the agent and the evaluation binary donot seem to be able to communicate. Both are waiting for the first communication. Did you change the default ports of communication etc by any chance ?

anssi · July 9, 2019, 11:17am

We only updated one of the model files after a successful submission. Worker ID is set to zero and environment filename to None.

Looks like that submission (29) did eventually manage to run through. Did you happen to re-queue the submission or did it work out on itself? Edit: Looks like you re-queued it.

Edit2: Looks like that submission (29) is now stuck and does not finish correctly ^^’

Edit3: Looks like we figured it out: When testing locally outside docker images, the OT game ended up stuck with empty screen (skybox) and our code also waited for the OT environment. We first had to launch our agent code and wait until it tries to connect to OT environment (“Start training by pressing the Play…”), after which we launched OT env and the evaluation started successfully. The solution for submission was to modify agent code to first create OT environment before everything else, including larger imports (TF, Torch, etc), and then proceed with creating/loading agents.

mohanty · July 9, 2019, 1:36pm

Maybe we should add a FAQ section to the starterkit which could begin as a separate markdown file (linked from the README) as a pull request from you !

Cheers,
Mohanty