Anyone else getting MaxRuntimeExceeded in training?

I often see my training jobs stopped without finishing 8M steps or 2h. They don’t fail, and show up as succeeded, but sometimes they run for only 2M or 4M training steps.

In the Docker logs I see the following message

MaxRuntimeExceeded - Training job runtime exceeded MaxRuntimeInSeconds provided

Anyone else having this problem?

1 Like

Hello @jurgisp

Can you share the submission IDs where this happened? At the moment, we are using

MAX_WAIT_TIME: "12000"
MAX_RUN_TIME: "7460" 

MAX_RUN_TIME is the time for which the instance has been running for. MAX_WAIT_TIME is the time for which the job has been active for (wait time for instance to be available + MAX_RUN_TIME). If the problem is with MAX_WAIT_TIME, we can increase its value.

2 Likes

Hi @jyotish I too recieved this error in #86264. I’d like to clarify that I’ve setup some code that saves the experience replay buffer along with the checkpoints because the reset of spot instances is messing up the replay buffer and really hurts performance. But the checkpoint saving and loading time doesn’t seem to be counted in ray “Time elapsed”. If this is the issue please suggest some proper solution to get replay buffers properly working with spot instances.

Thanks @jyotish. I think it’s a problem with MAX_WAIT_TIME, when AWS spot market is busy, and the job is not getting an EC2 instance for a long time, after being restarted. So perhaps would be good to increase it.

Although lately it’s been ok.

It’s happening again now, for example, hash ba3be1d72057541a70112017f6cb7601685551b0.

I too recieved it again on #87296 in gemjourney and hovercraft.