Questions about evaluation set up

Hi there. Can you elaborate a bit on the evaluation set up? In particular, I’m curious about what conditions will be different in Round 1 wrt to the environment we are given for development? Will it be exactly the same environment (including any parameters in the musculoskeletal model, noise level, etc.) and the only difference is the starting state and target field? Or will something else change?

Also, is there a time limit for computation?


Hi, @luisenp. The evaluation in Round 1 will be done with the same environment with the last version of the development environment. When you submit your solution, our server runs multiple simulations with different target velocity fields, and the evaluation score is the mean of the cumulative rewards you receive in those simulations.

1 Like

@smsong, BTW, I’ve just received more than 100 mean reward for my local environment with all default Round 1 properties, and when submitted at server environment - got exactly 52 steps at each run and 5 mean reward in total.
How could that be possible?

Let me know, if you need any specific info to investigate.

@andrey_zubkov: Did you get to figure out the issue? 52 steps means that the human model fell down in 0.52 s and total reward of 5 seems about right in such a case. The difference between your local environment and the server environment can be on the initial state where the muscle states can be slightly different when a simulation is initiated by env.reset(...). So your controller should be robust enough to overcome the difference in init muscle state.

In addition, based on reports from other participants, it seems like there could have been more differences in the server evaluation. For Round 2, we only accept docker submission. So please try a docker submission and let us know if you still get drastically different results.

I’ve investigated such a difference was claimed by my mistake in observation preprocessing, not taking the order difference into attention.
BTW, even after fixing 140-150 mean score got to approx. 20 by the cause of difference in initial state for evaluation environment. Hope, now there would be no such issue…

@andrey_zubkov Try the docker submission for Round 2 and let us know if you have the issue.