I often see my training jobs stopped without finishing 8M steps or 2h. They don’t fail, and show up as succeeded, but sometimes they run for only 2M or 4M training steps.
In the Docker logs I see the following message
MaxRuntimeExceeded - Training job runtime exceeded MaxRuntimeInSeconds provided
Anyone else having this problem?
1 Like
Hello @jurgisp
Can you share the submission IDs where this happened? At the moment, we are using
MAX_WAIT_TIME: "12000"
MAX_RUN_TIME: "7460"
MAX_RUN_TIME
is the time for which the instance has been running for. MAX_WAIT_TIME
is the time for which the job has been active for (wait time for instance to be available + MAX_RUN_TIME
). If the problem is with MAX_WAIT_TIME
, we can increase its value.
2 Likes
Hi @jyotish I too recieved this error in #86264. I’d like to clarify that I’ve setup some code that saves the experience replay buffer along with the checkpoints because the reset of spot instances is messing up the replay buffer and really hurts performance. But the checkpoint saving and loading time doesn’t seem to be counted in ray “Time elapsed”. If this is the issue please suggest some proper solution to get replay buffers properly working with spot instances.
Thanks @jyotish. I think it’s a problem with MAX_WAIT_TIME, when AWS spot market is busy, and the job is not getting an EC2 instance for a long time, after being restarted. So perhaps would be good to increase it.
Although lately it’s been ok.
It’s happening again now, for example, hash ba3be1d72057541a70112017f6cb7601685551b0
.
I too recieved it again on #87296 in gemjourney and hovercraft.