Hi, if the training does not finish in 2 hours is the submission considered as failed? Will the evaluation take the last checkpoint saved within the 2 hours?
Hi @xiaocheng_tang,
The submission is considered as failed right now after 2 hrs timeout.
But I think it is fair request to be able to use last checkpoint in many scenarios. Let us check with team and revert back to you with decision on it.
Thanks for the quick reply! I added the time limit to the stop conditions and it worked ok now, e.g., time_total_s: 7200
@jyotish @shivam Are the contest organizers willing to impose the 2 hour limit by forcing the time_total_s: 7200 like @xiaocheng_tang suggested instead of the current time limit the clusters are using? I’m finding that some of my training sessions time out before they’ve had a full two hours two train. For example, my last submission failed at 5883.03 seconds. The clusters are much busier now this past week with more people submitting and I’m guessing that a lot of time is being burnt on scheduling and overhead. Enforcing the 7200 second limit with the yaml file seems much more consistent and lets everyone have the same amount of training without being limited by how busy the clusters are.
Hello @tim_whitaker
The time_total_s: 7200
is already in place during evaluations. We also have a limit on max parallel evaluations, based on the resources available, to avoid long waits amid evaluations. We also have a hard timeout with 2 hours + buffer period to close the stuck evaluations. We will check if there are any issues with the scheduling time and adjust the buffer time accordingly.
I believe that you are referring to submission #70205. The GitLab updates are failing for that submission, but the evaluation is still in progress for that one. We will check why the updates on the GitLab issue page stopped.