Same code for fails for Run # 3 due to timeout but suceeds in Run #4 within 900s

Same code for fails for Run # 3 due to timeout but succeeds in Run #4 within 900s.
What could be the reason?
Is there some more limitation when time limit of run is 1020s. I have lost many submissions due to this weird behavior if there is no extra limitation? May be containers underlying compute machines are not equivalent, some are slow or some are fast. Can this be looked into?
Can I get some help for tracking time in purchase phase?

Following is mentioned for compute_budget:
The compute_budget argument holds a floating point number representing
the time available (in seconds) for BOTH the pre_training_phase and
the purchase_phase.
Exceeding the time will lead to a TimeOut error.
I have following questions on the above description:
a) Is it possible to catch TimeOut error and save the model weights? I think may be not but then wonder what is the purpose of mentioning TimeOut error.
b) How do track time when training as compute_budget also includes time for pre-training, which is not known? Or I assume some time based on previous run. In this scope it limits the number of experiments in pre-training phase as first I have to determine pre-training time.

Regards
Aditya

1 Like