Have you ever successfully run 10M steps without resetting env?

tky · November 12, 2019, 11:38am

I’ve been running intrinsic phase locally without resetting environment as env reset is not allowed in evaluation server. However, I found out that my script dies without error.

I observed increase of memory usage even when running RandomPolicy so I assume there is a memory issue in environment as the number of steps increases in one episode.
I also suspect that makes evaluation process stop or even timeout error. (My RandomPolicy submission is still stuck around 2M steps for a few days now)

Is there anyone facing similar situation?
Or is this just my problem?

ec_ai · November 12, 2019, 12:10pm

@tky, I did run the simulation up to 10M steps on a 64GB RAM machine many weeks ago.
I didn’t notice if the simulation itself had increasing memory usage, however I did notice that if you store all the observations (images in particular) you are likely to have memory problems.
As an example, retina images for just 100k steps result in a 23GB file when saved as .npy.
Contacts and joints are easier to handle but for 10M steps still result in 720MB and 320MB files.

Did you run just RandomPolicy as it is, without saving anything?
(i.e. just downloaded the repository and run the local evaluation for 10M)

EDIT: I have launched now a 10M local evaluation with a new copy of the repository… I will let you know how it goes.

tky · November 12, 2019, 1:25pm

Ah, you’re right. It was definitely because of storing observations in my local code. (Should have noticed before asking this question )
But I’m still thinking why my evaluation is stuck though. My code in gitlab repository at this point doesn’t store anything (RandomPolicy as it is). It seems like no submission has successfully finished yet (I see no entry for round 2).

ec_ai · November 12, 2019, 1:27pm

@tky
I am submitting a new submission myself and I will contact @shivam to investigate further.

shivam · November 18, 2019, 2:05am

Hi, there are multiple fixes we have pushed ~12 hours back to get rid of network dependency in real robots evaluations, so it doesn’t effect any running submission.

We have requeued last submission from every username and scores should be available soon.

Addition: The memory isn’t the problem which I see for evaluations, for context after 2-3M steps, it is 4.64G for @tky’s solution and 1G for @ec_ai (sample submission).