Submissions not finishing

simon_mezgec · April 1, 2020, 2:30pm

My submissions for today don’t seem to be finishing (they are stuck at the “submitted” state - “Evaluation Started” at GitLab). I uploaded one 15 hours ago, and another one 2 hours ago, and they’re both stuck. Looking at the submissions page, submissions from two other contestants also appear to be stuck. Is this a server-side issue? Just so I know whether there’s something wrong with my submissions.

simon_mezgec · April 2, 2020, 3:08pm

Update: all my submissions failed after a long time, two with this error:
“Unable to orchestrate submission, please contact Administrators.”

And two with this:
“Whoops ! Something went wrong with the evaluation. Please tag aicrowd-bot on this issue to provide you the relevant logs.”

Tagging the bot didn’t do anything.

I’m not sure what the issue is. My submissions from before 1st April worked without a problem and finished evaluating after around an hour and a half.

hannan4252 · April 2, 2020, 5:00pm

Same happened with me i submitted 4 but only 1 completed …

simon_mezgec · April 2, 2020, 5:06pm

Interesting, thanks for confirming! So most likely an issue with the evaluator server. Will wait for official confirmation, but it does seem like that’s the problem.

nikhil_rayaprolu · April 3, 2020, 12:30pm

@simon_mezgec
While debugging your submissions, we realised that you have changed nothing but only an addition of new model (epoch_1.pth -> model.pth), and its started taking 20 hours. Before the change it was just taking 3.5 hours.

We would like to know what is the difference between the baseline model (epoch_20.pth, epoch_1.pth) vs your model (model.pth) to find out why it’s taking more than 5x time.

@hannan4252 We are debugging your submissions and will let you know accordingly once we figure out the issue.

simon_mezgec · April 3, 2020, 3:13pm

Thanks for the reply - for some reason it didn’t appear for me until now, which is why I posted my previous reply (which I’ve since deleted).

The model file I uploaded (model.pth) is based on the same architecture (HTC R-50-FPN) as the baseline submission. I used the same config file to train a new model - the only changes done to the config file were to add the correct paths to the data, and changing the training hyperparameters. The idea was to start with the baseline submission and start experimenting with parameters before moving onto other architectures, but it seems that something went wrong, and I’m not sure what that is.

I tried debugging on my PC, but everything works correctly on my end (also the evaluation speed for the 418 validation images is the same as for the baseline model).