Submissions failing

Hello my submissions have started randomly failing, I complete the Music Demixing Validation but then fail on the Music Demixing

Link to submission AIcrowd

Here are the debug logs AIcrowd

and the admin reference logs https://dashboard.aicrowd.com/explore?orgId=15&left={"datasource":"Logs","queries":[{"refId":"A","expr":"{aicrowd_com_submission_id%3D\"213912\",%20aicrowd_com_logs%3D\"true\",%20container%3D\"main\"}"}],"range":{"from":"now-2d","to":"now"}}

@dipam @mohanty

I dont think this is caused by submission timeout because the Music Demixing Validation part would have failed aswell

@dipam @mohanty

My issue is being caused by pytorch 2.0.0, my code requires pytorch 1.13.1 and I have it listed in my requirements.txt as torch==1.13.1 but it seems that the Music Demixing Validation and Music Demixing are using different environments?

Please can you tell me how to fix this error?

@kimberley_jensen

The environments for both validation and main run are the same. We build one docker image from your repository and reuse it for both runs. So they will have the same environments.

@dipam

Do you have any suggestions why the inference in my submissions for validation works but for the main run it fails please? These are debug logs

submission: AIcrowd

debug logs: AIcrowd
admin logs: https://dashboard.aicrowd.com/explore?orgId=15&left={"datasource":"Logs","queries":[{"refId":"A","expr":"{aicrowd_com_submission_id%3D\"214465\",%20aicrowd_com_logs%3D\"true\",%20container%3D\"main\"}"}],"range":{"from":"now-2d","to":"now"}}

@dipam @mohanty

Please can you check if there is any issues for when the inference timer begins on the main run? Because my inference speed on validation is 2.103x and the allowed time is 1.68x, there is nothing in my code that will cause that much difference, i have run lots of local evaluations and the only difference is like 3 seconds between evaluations.

Is it normal when going between validation and the main run to have “Preparing the cluster for you
PodInitializing” and “PodInitializing” happen again?

Also is there any better way to talk about my issue? Like on discord? Because only getting one reply per 1-2 days is not good for me when the contest ends in 20 days

The nodes we provision from AWS can have some variation in CPU type while in between runs, this is beyond our control. The only guess I can make about the runtime difference is high CPU usage.

As for Podinitializing, yes that happens for each stage of the run, its normal.

Discourse or email is the best channel to contact us. We understand the challenge is in the last stages and will try to reply as soon as possible.