[Admin] Good News for Dopamine Lovers!

#1

So after trying to debug all the issues with the Dopamine based submissions, it turns out there were two key issues :

  • Not enough resources for the agent containers : Now agent containers have access to 10GB of memory, and I managed to get one of the submissions run ! So hopefully this will solve the issue for most of the users
  • Windows users trying to check in large ( > 4GB) files into the git repository are hitting this sad bug : https://github.com/git-lfs/git-lfs/issues/2434
    Until this bug is fixed upstream, we recommend using linux or macosx to checking your checkpoints into the repository.

ProTip : We noticed many users are checking in all their checkpoints into the repository. While now that we are using git-lfs, having the checkpoints in the git history is fine (as it only stores references, and not the actual binary blobs in the git history). We would urge you to only keep the required checkpoint at the head of your pushed tag (You can still revert back in history and recover the older checkpoints if you need ! :smile:). But having just the necessary checkpoint at the root of your repository ensures that the docker images that are built out of your repositories are not excessively huge !!!

And finally, another great find by @felixlaumon :

=================================================
Another trick I found is to override agent._replay_load like below to prevent Dopamine from loading the replay buffer. The buffer isn’t necessary for evaluation anyway and you don’t have to commit huge files such as $store$_observation_ckpt.*.gz

agent._replay.load = lambda _1, _2: True

=================================================

2 Likes

Is there a max repo size?
#2

Another trick I found is to override agent._replay_load like below to prevent Dopamine from loading the replay buffer. The buffer isn’t necessary for evaluation anyway and you don’t have to commit huge files such as $store$_observation_ckpt.*.gz

agent._replay.load = lambda _1, _2: True
4 Likes

#3

In which script do you override that? I am having difficulties uploading my 4.6GB observation file, but as of now my agent doesn’t seem to perform as if it were trained unless it loads the observation data.

Edit:
Search project wasn’t working for me, I found it in the last method, “unbundle”, of dqn_agent.py

0 Likes