Spot instances issues

In a submission that I made @jyotish mentioned that submissions were being trained on AWS spot instances.
This means that the instance can shut down anytime.
Checkpoints of the models are made periodically, and training is then restored.

The problem with that is that many important training variables/data are not stored in checkpoints (e.g. the data in a replay buffer) and this leads to unstable training which vary depending your luck of not being interrupted.

Here is a couple of plots of the training mean return of one of my submissions. The big stretched lines are due to faraway datapoints (agent being interrupted and then restored.)


What does the Aicrowd team recommends here? Storing the replay buffer and any important variables whenever checkpointing?

4 Likes

Hi @jyotish

Any suggestions on how to save the replay buffer properly with ray?

Any suggestions on this? Any persistent storage we can use to avoid data loss during spot instance interruptions?

Hi @xiaocheng_tang, I worked around it by saving the replay buffer as part of the checkpoint (with compression), but it seems like this is causing some new error MaxRuntimeExceeeded.

Waiting for organizers for a proper fix.

Having the same issues too.

@jyotish could you clarify, during the final evaluation (3 runs, all 16 environments), will you be using AWS spot instances too? Or can we assume that the final evaluation will be run on more reliable instances, without restarts?