Updates for starter kit

dipam · October 1, 2021, 6:22pm

Hi Everyone,

We’ve made some changes to the docker image and starter kit to avoid users getting errors related to ZMQ connection with docker. These should reduce the hassle of the interactions with deepracer-gym and the docker image. We encourage everyone to take in these changes.

Pull the latest release of the docker

docker pull aicrowd/base-images:deepracer_round1_release

Pull the latest version of the starter kit and replace the deepracer-gym folder wherever you’re currently working with it.

Reinstall deepracer-gym

cd neurips-2021-aws-deepracer-starter-kit/deepracer-gym
pip install -e .

Feel free to reach out with any further issues.

notnanton · October 4, 2021, 12:04am

Hi, thanks for the update! It was running quite smoothly until some hours ago. Now I cannot connect to the docker env at all.

This is the error message of the docker container. I would be happy if you can help me out:

AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...                                                                                                                                                   
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...                                                                                                                                                   
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...                                                                                                                                                   
=================== Gym Client Ready! ===================                                                                                                                                                 
## Created agent: agent                                                                                                                                                                                   
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...                                                                                                                                                   
## Stop physics after creating graph                                                                                                                                                                      
## Creating session                                                                                                                                                                                       
2021-10-03 22:45:47.197613: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                             
Reset agent                                                                                                                                                                                               
Reset agent finished                                                                                                                                                                                      
{"simapp_exception": {"date": "2021-10-03 22:45:49.716887", "function": "deepracer_racetrack_env.py::_update_state::186", "message": "Unclassified exception: list indices must be integers or slices, not
 list", "exceptionType": "simulation_worker.exceptions", "eventType": "system_error", "errorCode": "500"}}                                                                                                
ERROR: FAULT_CODE: 0                                                                                                                                                                                      
simapp_exit_gracefully: simapp_exit--1                                                                                                                                                                    
Terminating simapp simulation...                                                                                                                                                                          
simapp_exit_gracefully - callstack trace=Traceback (callstack)                                                                                                                                            
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main                                                                                                                                    
    "__main__", mod_spec)                                                                                                                                                                                 
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code                                                                                                                                               
    exec(code, run_globals)                                                                                                                                                                               
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/rollout_worker.py", line 558, in <module>                                                                               
    rollout_entry()                                                                                                                                                                                       
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/rollout_worker.py", line 538, in rollout_entry                                                                          
    main()                                                                                                                                                                                                
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/rollout_worker.py", line 532, in main                                                                                   
    unpause_physics=unpause_physics                                                                  
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/rollout_worker.py", line 200, in rollout_worker                                                                         
    graph_manager.act(act_steps, wait_for_full_episodes=graph_manager.agent_params.algorithm.act_for_full_episodes)                                                                                       
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/multi_agent_coach/multi_agent_graph_manager.py", line 438, in act                                                       
    done = self.top_level_manager.step(None)                                                         
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/multi_agent_coach/multi_agent_level_manager.py", line 232, in step                                                      
    for action_info in action_infos])                                                                
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/multi_agent_coach/multi_agent_environment.py", line 185, in step                                                        
    self._update_state()                                                                             
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/environments/deepracer_racetrack_env.py", line 186, in _update_state                                                    
    SIMAPP_EVENT_ERROR_CODE_500)                                                                     
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/log_handler/exception_handler.py", line 74, in log_and_exit                                                             
    s3_crash_status_file_name=s3_crash_status_file_name)                                             
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/log_handler/exception_handler.py", line 179, in simapp_exit_gracefully                                                  
    callstack_trace = ''.join(traceback.format_stack())

simapp_exit_gracefully - exception trace=Traceback (most recent call last):
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/environments/deepracer_racetrack_env.py", line 140, in _update_state
    self.action_list)]
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/environments/deepracer_racetrack_env.py", line 139, in <listcomp>
    [self._agents_info_map.update(agent.update_agent(action)) for agent, action in zip(self.agent_list,
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/agents/agent.py", line 79, in update_agent
    return self._ctrl_.update_agent(action)
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/agent_ctrl/rollout_agent_ctrl.py", line 564, in update_agent
    self._data_dict_, action, self._model_metadata_.get_action_dict(action),
  File "/opt/amazon/install/sagemaker_rl_agent/lib/python3.6/site-packages/markov/boto/s3/files/model_metadata.py", line 100, in get_action_dict
    return self._model_metadata[ModelMetadataKeys.ACTION_SPACE.value][action]
TypeError: list indices must be integers or slices, not list

simapp_exit_gracefully - skipping s3 upload.
simapp_exit_gracefully - Job type is SageOnly. Killing SimApp and Training jobs by PID
simapp_exit_gracefully - Waiting for simapp and training job to come up.
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
simapp_exit_gracefully - Waiting for simapp and training job to come up.
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
simapp_exit_gracefully - Waiting for simapp and training job to come up.
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
simapp_exit_gracefully - Waiting for simapp and training job to come up.
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
AgentsVideoEditor._mp4_queue['0'] is empty. Retrying...
simapp_exit_gracefully - Stopped waiting. SimApp Pid Exists=True, Training Pid Exists=False.
+ exit
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 12268 requests (789 known processed) with 0 events remaining.

This is the error message I get in the terminal:

  File "/home/anton/deepracers/agents/ppo_agent.py", line 96, in reset
    observations = self.env.reset()
  File "/home/anton/deepracers/deepracer-gym/deepracer_gym/envs/deepracer_gym_env.py", line 11, in reset
    observation = self.deepracer_helper.env_reset()
  File "/home/anton/deepracers/deepracer-gym/deepracer_gym/zmq_client.py", line 49, in env_reset
    self.obs = self.zmq_client.recieve_response()
  File "/home/anton/deepracers/deepracer-gym/deepracer_gym/zmq_client.py", line 25, in recieve_response
    packed_response = self.socket.recv()
  File "zmq/backend/cython/socket.pyx", line 781, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 817, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 191, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 186, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable

nimishsantosh · October 4, 2021, 8:11am

Hey @notnanton, judging from this line:

return self._model_metadata[ModelMetadataKeys.ACTION_SPACE.value][action]
TypeError: list indices must be integers or slices, not list

looks like action is a list instead of an int. I’m not sure how to replicate this issue to account for the slim chance where the action provided to the docker via the zmq client is ever modified but just to be clear, can you let me know what your agent provided as the action to the env.step()?
And also if there’s no error on your end, do let me know any steps to reproduce this.

Do not worry about the error on the client side terminal, it just timed out since the sim running on the docker failed.

notnanton · October 4, 2021, 9:34am

Thanks, you are right! I was passing my action wrapped in a list in a wrong spot due to my use of vectorized environments.

I guess I got confused because the error wasn’t raised in my code, so I thought the docker container must be at fault.

azam_kamranian · October 5, 2021, 1:17pm

Are obstacles added to new docker image? I cant see any obstacles in my env!!
How can I add some in my env?

asche_thor · October 13, 2021, 2:19pm

Hi, it seems there is a weird bug: sometimes, if my agent leaves the track, the agent gets simply reset onto the track, but the done flag is not set to True and there is no negative reward neither… did anyone else make the same experience with this new version?

azam_kamranian · October 13, 2021, 2:32pm

Yeah, done flag won’t happen if ur agent go off the road. I believe it is only true of u successfully end a loop. And you should modify reward for ur agent.

asche_thor · October 15, 2021, 3:09pm

Hmm, that’s weird. Usually, I always get a done flag if my agent leaves the road. However somehow, this bug appears only when I’m connecting multiple agents to the same server. So maybe they cause some interference there?

asche_thor · October 17, 2021, 9:03am

I found another weird behavior of the env.reset() function: According to the openai gym specifications, env.reset() should return an initial observation of a new episode. However, if we call env.reset() after an episode has ended, it returns the last observation of that previous episode instead of an initial observation of the next episode. Furthermore, the data format of the observations returned by env.reset() is different compared to the observations returned by env.step(action). Is there anyone else with the same problems or is there a misunderstanding on my side? Thx

dipam_chakraborty · October 17, 2021, 2:53pm

Hi @asche_thor

This used to be an older issue but was fixed, are you on the latest version of the env docker release

Can you please pull the latest docker and check once.

darren_broderick · October 24, 2021, 1:10pm

Is there a way to put in our reward function, hyper params and action space?

dipam · October 25, 2021, 5:02am

Hi @darren_broderick

You can change reward by using a wrapper, hyperparams shouldn’t be related to the env so you can change them as you wish. No changes allowed in the action space for this round and will probably be same for next round as well.

darren_broderick · October 25, 2021, 6:46am

Thank you.

What do you mean by wrapper?

dipam · October 27, 2021, 5:39am

Hi @darren_broderick

This should give a good introduction to gym wrappers.