@shivam My submission took more than 90 minutes, but it was successful. Will it be counted as the final result? In addition, I found that the difference in submission time with the same amount of calculation will reach 2000 seconds. When two submissions are calculated at the same time, the calculation time will increase. Is there a GPU physically shared by multiple submissions when the same user submits a queue?
The time elapsed, that is reported on the issue page is currently wrong and will be fixed soon, it shows the total time from the submission to completion (instead of start of execution of your code to the end). The timeout however is properly implemented and only considers the running time.
We provision a new machine dynamically for each submission, due to which the time elapsed might have been higher when there are a high number of submissions in the evaluation queue (multiple machines got provisioned)
@shivam@mohanty Iβm getting CUDA out of memory error when loading my model on pytorch, even though run.py works on my own T4 and V100. I tested it inside the same container that the Dockerfile builds. I donβt know what else do to at this point.
Hash: af5e3e9d5a515b6917e2d39340da51e23b23d878
@mohanty@shivam
we meet the problem that the environment isnβt configured well for 2 hours
Could you take a look at it
submit hash: 00d6a5fb492d8648d5cc1724ce7efcd79b3f532d
@shivam
i have no idea for this case error logοΌcanβt see any errorοΌbut failed
AIcrowd Submission Received #193832 - initial-15
submission_hash : 77fd92f1686f89bb2a0a4a09ab2cb83cce5f3e0c.
If this issue is not updated within a reasonable amount of time, please send email to help@aicrowd.com.
Could you also check my submission? I believe there is an unusual behavior of some hosting services. The code passed the public test set and soon failed for the private set. I also observed other participantsβ submissions near the same time and all of them failed.
submission_hash : 540adaa2989b1c62dffc48659400db2cc0a13989.
@wufanyou : The evaluation failed due to a timeout. The increased timeout of 120 minutes should fix this issue. We have re-queued your submission for re-evaluation.
Hi all, in case you feel your submission is running quite slow online v/s your local setup.
It might be a good idea to verify torch or relevant packages are installed properly.
Here is an example for torch:
In case you are confused how to verify for your package, please let us know and we can release relevant FAQs.
Hi @shivam and @mohanty. To debug this, I did the following: Iβm printing nvidia-smi on my prediction_setup method, right before loading my model and it seems it gets executed 2 times, and thatβs why Iβm getting this CUDA out of memory error.
The first time it loads correctly but it canβt load a second time without releasing GPUβs RAM.
Any idea why itβs loading 2x?
We have identified the bug during the evaluation phase which caused the models to load twice, and is now fixed.
We have also restarted your latest submission and monitoring if any similar error happens to it.