Product Matching: Inference failed

dipam · March 1, 2023, 5:10am

Thank you for the insightful and extensive testing.

Indeed the experiments are run with Kubernetes, however we explicitly provision on-demand AWS g4dn.xlarge (for the GPU nodes) for all the submissions. I understand that this may allow for variations in the quality of CPU or memory that AWS has attached to the nodes. However this is the only way we can run the competition.

As for the time limit, the Machine Can See organizers have explicitly asked for the 10 minute limit, as they want the models to fast. So participants have to add provisions in their code to track the time spent, or ensure that the code has enough buffer to allow for the allegedly slower runs to complete.

hca97 · March 1, 2023, 2:52pm

Hi @dipam,

I would like to ask if it is possible to get the following information.

bartosz_ludwiczuk · March 11, 2023, 12:51pm

Hi @dipam. Could you send me the errors for error for this submission?

211540

Also, I can mention that in my last 4 failed submissions, 3 were because of aicrowd platform.

2x Build-Env failed, when nth was changed in dependences (try in next day succeeded)
1x Product MatchingValidation time-out where in logs it not even started

Not sure if AiCrowd has changed sth or if it was just errors coming from Cloud Provider

dipam · March 11, 2023, 1:23pm

I’ve added the error in the gitlab comments.

About the failed builds with no changes to dependencies. The only time this has occurred is that the repository has too many models which kills the Docker build, not sure if it’s also the case here, I’ll check further. If it is indeed the cause, unfortunately for now you’ll have to reduce the overall repository size somehow.

For the one with time-out but it didn’t start, can you give me the submission id?

hca97 · March 11, 2023, 10:50pm

Hi,

If the git repo is too large maybe you ignore the .git folder when building the docker image.

bartosz_ludwiczuk · March 13, 2023, 2:04pm

Here is the list of submissions with weird errors:

211521 (env failed)
211522 (env failed)
211387 (Product Matching validation time-out, but I don’t see logs for loading the model)

Also, could you also check out what happened here?

211640 (I think the error is on my side)

youssef_nader3 · March 15, 2023, 1:15am

@dipam can you please let me know what went wrong here, It says inference failed and there’s nothing in the logs, submission:211733, thanks!

strekalov · March 15, 2023, 4:46am

@dipam Hello, please help me why the submission failed on product matching stage. AIcrowd

dipam · March 16, 2023, 9:18am

@strekalov , I’ve added your error on gitlab comment.

dipam · March 16, 2023, 9:20am

@youssef_nader3 I’ve added your error on gitlab comment.

bartosz_ludwiczuk · March 17, 2023, 8:01am

@dipam
Could you add an error to the following inference errors:

#211884
#211754

dipam · March 17, 2023, 8:15am

@bartosz_ludwiczuk

They timed out.

ilia_denisov · March 24, 2023, 5:49am

@dipam Could you provide more info about submissions #212367 and #212368, please?
Both are Product Matching: Inference failed

Mykola_Lavreniuk · March 24, 2023, 8:53pm

@dipam, could you please check
#212567
#212566
It is failed in step “Build Packages And Env”, however I have changed only NN params.
As to me it is very strange…

strekalov · March 25, 2023, 2:26pm

@dipam Can you help me why my submission failed, please? AIcrowd

bartosz_ludwiczuk · March 29, 2023, 6:37am

@dipam Could you check submission: 213161?

The diagram shows that everything worked fine, but the status is failed:

youssef_nader3 · April 3, 2023, 11:23pm

@dipam sorry to bother you again. but can you please let me know why 214218 and 212720 failed?

dipam · April 3, 2023, 11:47pm

@youssef_nader3 Both of these timed out

Mykola_Lavreniuk · April 5, 2023, 2:44pm

@dipam , have you changed some settings of the server for inference?
Previously I have faced near 1 failed submitting per day, just rerunning helps.
however today I have changed only NN weights files and thats all, 1 get 1 submit ok, and 4 other weights with same size, same model all same just other epochs - failed.
As to me it is very strange…
Could you pls check it?

#214502

#214483

#214469

If it is timeout, how it could be if other weights are ok, or just resubmitting sometimes helps?

dipam · April 5, 2023, 5:33pm

@Mykola_Lavreniuk

All these timed out, they’re just barely above the time limit. The variation in runtime can be due to the slight variation in CPU type that the AWS nodes we provision can have. Hence resubmitting can sometimes help, but for consistency I suggest trying to bring down the compute time.

I understand that every second might matter, however the organizer has deemed 10 minutes to be a generous time limit for the kind of solutions they are looking for, hence the constraint.