Reasons of submission failure

Today my new submissions were successfully evaluated, but unexpected failures waste attempts and can hinder the participants on the last day of the challenge.

@StefanUhlich @dipam
I think a separate counter for unsuccessful submissions should be enabled.

@alina_porechina @subatomicseer

I’ll investigate it, can I please have the respective submission ids.

1 Like

Take a look at failed #210059 and #210078 (there is no need to restart them). Failed #210078 and successful #210117 are identical.

Mine was #209954. No need to restart it either, but would be nice to know what happened.

I seem to be having a similar issue. Submission: #210149 (music demixing leaderboard C)

@dipam

Mine has been stuck at demixing 3/27 for 2 hours now. I also used up a submission since debug was false in the aicrowd.json

My submission also failed after 2.5hrs despite having an inference speed of 2.39x… it previously succeeded twice in a row, the only difference was that debug was set to false instead of true within the aicrowd.json.

submission_hash : 81a881bc88d002c09e600b5496fb1c98f59d3ad6.

@dipam

#210197 failed at 3/27, speed 0.769x

@alina_porechina , Unfortunately I haven’t found what the issue is on our side. I’m looking into it and it should be resolved soon.

Can you let me know if you expect your model to be consistent in the speed for every song or it can vary? Since the evaluator is checking if every prediction is above the speed constraint.

@dipam

My understanding is that the speed of my model is constant.

#210197 (2/2 - 0.771x, 3/27 - 0.769x) may have fallen due to low speed.

It is less likely that #210059 fell due to low speed (2/2 - 0.926x, 9/27 - 0.913x).

#210078 and #210117 are identical. #210078 returned “Evaluation timed out” during validation phase. #210117 was evaluated quickly and successfully (2/2 - 3.369x, 27/27 - 3.479x).

I think it’s better to turn on the second counter instead of looking for the causes of rare crashes.

@dipam

Today I successfully resubmitted code from failed #210197 (2/2 - 0.771x, 3/27 - 0.769x) as #210246 (2/2 - 0.757x, 27/27 - 0.749x)

I’ve made some changes to how the cloud instances are provisioned for inference, I think now the failures shouldn’t occur. However if you do observe it again please let me know.

1 Like

#210438
After demixing I got the message: “Scoring failed. Please contact the admins.”

1 Like

#210461
After successful demixing (2/2 - 1.73x, 27/27 - 1.749x) I got the message: “Evaluation timed out”

#210459
Demixing got stuck (2/2 - 0.689x, 6/27 - 0.692x), then I got “Evaluation timed out”

@dipam

Today 3/5 of my submissions failed unexpectedly

@alina_porechina

I’ve rerun the submissions after adding an extra buffer to the timeout. They’re scored now.

1 Like

@dipam

#214686
#214701
Can you please rerun these submissions?
(“Scoring failed. Please contact the admins”)

Today, due to my mistake, 4 submissions failed with “Prediction too slow”. They ran simultaneously. Could this have somehow overloaded the system and caused scoring to fail?

@dipam
Also please rerun #214696 and #214694 (“Evaluation timed out”)

@alina_porechina All runs are staged on independent AWS instances, so simultaneous submissions will not slow down the runs.

I’ll check the scoring failed runs asap.

1 Like

Evaluation fails for me, #217689, #217455, and some more submissions for a couple of days. The inference is completed, but the scoring fails. What might be the reason? The same submission just with different model weights succeeded yesterday.

1 Like