So after trying to debug all the issues with the Dopamine based submissions, it turns out there were two key issues :
- Not enough resources for the agent containers : Now agent containers have access to 10GB of memory, and I managed to get one of the submissions run ! So hopefully this will solve the issue for most of the users
- Windows users trying to check in large ( > 4GB) files into the git repository are hitting this sad bug : https://github.com/git-lfs/git-lfs/issues/2434
Until this bug is fixed upstream, we recommend using linux or macosx to checking your checkpoints into the repository.
ProTip : We noticed many users are checking in all their checkpoints into the repository. While now that we are using git-lfs, having the checkpoints in the git history is fine (as it only stores references, and not the actual binary blobs in the git history). We would urge you to only keep the required checkpoint at the head of your pushed tag (You can still revert back in history and recover the older checkpoints if you need ! ). But having just the necessary checkpoint at the root of your repository ensures that the docker images that are built out of your repositories are not excessively huge !!!
And finally, another great find by @felixlaumon :
=================================================
Another trick I found is to override agent._replay_load
like below to prevent Dopamine from loading the replay buffer. The buffer isn’t necessary for evaluation anyway and you don’t have to commit huge files such as $store$_observation_ckpt.*.gz
agent._replay.load = lambda _1, _2: True
=================================================