Product Matching: Inference failed

georgy_surin · February 15, 2023, 8:28am

hello
I have a problem with my submission too. Local evaluation works fine, validation check in submission too but Product Matching: Inference failed. Can you give me some more logs please or another help

hca97 · February 15, 2023, 11:19am

Hi,

I am having the same problem too. The funny thing is it happens to me randomly if I submit the same solution multiple times sometimes it works and sometimes it fails

My submission IDs are;
Passed: 209969
Failed: 209962

Both have the same code just submitted at different times.

Maybe @dipam can have a look? I was initially suspecting a timeout error but then my second submission should not succeed.

dipam · February 15, 2023, 11:34am

@hca97

Submission 209962 timed out, the machine provided is always same, so I’m not sure if there is any other randomness in your code which makes it take more time. The full inference needs to complete in 10 minutes.

hca97 · February 15, 2023, 11:41am

Thank you for your quick reply I will check on it. I have another quick question, in 10 min how many images are we expected to process? On the home page, it says over 40k images but doesn’t specify an exact number.

georgy_surin · February 15, 2023, 12:10pm

@dipam Can you have a look at my submission #209959 ?

dipam · February 15, 2023, 12:22pm

@georgy_surin 209959 is CUDA OOM

dipam · February 15, 2023, 12:25pm

@hca97 The inference dataset has around ~18k images including both query and gallery images.

radhika_chhabra · February 23, 2023, 12:59pm

@dipam Can you please look into this error, we are getting the same error: “no such manifest: docker.io/aicrowd/iglu:210527
Docker image not found, building…” AIcrowd

dipam · February 23, 2023, 1:07pm

@radhika_chhabra Can you check my comment on your latest submissions’s gitlab issue page.

hca97 · February 28, 2023, 10:30pm

Hi @dipam,

I was looking at my code to see if I have any randomness. However, when I try to run it locally multiple times I am getting similar runtime (maybe 100/200 ms is different but it is acceptable).

What I observe is that when I submit my solution the deviations get increased, at least what I see in the product validation step.
For example:
210756: validation step takes 9.2s
210897: validation step takes 7.8s

Are the experiments run on Kubernetes Cluster? If so, I was thinking maybe the resources allocated to the pod/node are not the same. Maybe one pod/node gets better hardware than the other one (heterogeneous kubernetes cluster, container performance difference - Stack Overflow). Then I guess having a 10min run time might be unfair for all the experiments.

Here are some examples that might help us/me to debug, these examples use exactly the same models. However, they are trained on different training configurations or different epochs of the same model.

Example 1:
Passed: 210329, 210321, 209969, 209617
Failed: 210701, 210710

Example 2:
Passed: 209881
Failed: 210897

How much time did the passed examples take? Is there a noticeable runtime difference? For the failed example how many percent of the test dataset is run?

Thank you for your time.

dipam · March 1, 2023, 5:10am

Hi @hca97

Thank you for the insightful and extensive testing.

Indeed the experiments are run with Kubernetes, however we explicitly provision on-demand AWS g4dn.xlarge (for the GPU nodes) for all the submissions. I understand that this may allow for variations in the quality of CPU or memory that AWS has attached to the nodes. However this is the only way we can run the competition.

As for the time limit, the Machine Can See organizers have explicitly asked for the 10 minute limit, as they want the models to fast. So participants have to add provisions in their code to track the time spent, or ensure that the code has enough buffer to allow for the allegedly slower runs to complete.

hca97 · March 1, 2023, 2:52pm

Hi @dipam,

I would like to ask if it is possible to get the following information.

bartosz_ludwiczuk · March 11, 2023, 12:51pm

Hi @dipam. Could you send me the errors for error for this submission?

211540

Also, I can mention that in my last 4 failed submissions, 3 were because of aicrowd platform.

2x Build-Env failed, when nth was changed in dependences (try in next day succeeded)
1x Product MatchingValidation time-out where in logs it not even started

Not sure if AiCrowd has changed sth or if it was just errors coming from Cloud Provider

dipam · March 11, 2023, 1:23pm

I’ve added the error in the gitlab comments.

About the failed builds with no changes to dependencies. The only time this has occurred is that the repository has too many models which kills the Docker build, not sure if it’s also the case here, I’ll check further. If it is indeed the cause, unfortunately for now you’ll have to reduce the overall repository size somehow.

For the one with time-out but it didn’t start, can you give me the submission id?

hca97 · March 11, 2023, 10:50pm

Hi,

If the git repo is too large maybe you ignore the .git folder when building the docker image.

bartosz_ludwiczuk · March 13, 2023, 2:04pm

Here is the list of submissions with weird errors:

211521 (env failed)
211522 (env failed)
211387 (Product Matching validation time-out, but I don’t see logs for loading the model)

Also, could you also check out what happened here?

211640 (I think the error is on my side)

youssef_nader3 · March 15, 2023, 1:15am

@dipam can you please let me know what went wrong here, It says inference failed and there’s nothing in the logs, submission:211733, thanks!

strekalov · March 15, 2023, 4:46am

@dipam Hello, please help me why the submission failed on product matching stage. AIcrowd

dipam · March 16, 2023, 9:18am

@strekalov , I’ve added your error on gitlab comment.

dipam · March 16, 2023, 9:20am

@youssef_nader3 I’ve added your error on gitlab comment.