Lately it’s been impossible to get a submission to train and evaluate successfully, due to spot instance timeouts. There are always either some trainings not completing to 8M steps, or rollouts failing. On top of that, the submission queue has grown to above 12 hours.
If this continues, with 7 days remaining, I don’t see how it will be possible for all participants to get at least one good submission with their latest version.
I’d like to suggest the following to the organizers:
- Switch to running on dedicated on-demand instances, not spot instances.
- Limit to 1 submission a day
As I see it, this way everyone wins. Participants can at least get one good submission a day, rather than spamming 5 submissions, hoping at least one will finish. And the total compute cost shouldn’t increase, because 1 on demand submission cost is about 3 spot submissions.