Submission confusion. Am I dumb?

HarryWalters · December 14, 2019, 3:25pm

Hello,

I haven’t received any logs from an admin regarding my most recent failed submission (3 days ago).

Additionally, my other submission from 2 days ago is still being evaluated; but other people have been graded in the same time. Am I doing something obviously wrong?

I am open to any ideas or advice!

rohitmidha23 · December 14, 2019, 3:34pm

Not sure what could be your problem, but we wrote code to check if the GPU was even there and it gave an error. So if your code uses GPU you have your answer.

@shivam @mohanty please add the GPU back. Thanks

ashivani · December 14, 2019, 4:25pm

@rohitmidha23 @HarryWalters Can you please share the submission links.

shraddhaa_mohan · December 14, 2019, 4:49pm

https://gitlab.aicrowd.com/shraddhaamohan/food/issues/8?_ga=2.140379123.1522142187.1576341304-652515492.1564020833#

shivam · December 19, 2019, 1:15am

I can confirm that GPU is available for evaluations if you have used gpu: true in your aicrowd.json, and they were not removed at any point. In case someone is facing launching GPU in their submission, please share your submission ID with us so it can be investigated.

@shraddhaamohan, in your submission above i.e. #27829, your asset was assert torch.cuda.is_available()==True,"NO GPU AVAILABLE" which wasn’t showing full issue.

I tried to debug it on your submitted code, and this was happening:

>>> import torch
>>> torch.backends.cudnn.enabled
True
>>> torch.cuda.is_available()
False

aicrowd@aicrowd-food-recognition-challenge-27829-38f8:~$ nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-cd5d75c4-a9c5-13c5-bd7a-267d82ae4002)
aicrowd@aicrowd-food-recognition-challenge-27829-38f8:~$ nvidia-smi
Tue Dec 17 14:19:22 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    30W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

We further found this is happening because the underlying CUDA version we provide to submissionos was 10.0 and submissions are evaluated with docker image “nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04”. While in your submission you have custom Dockerfile which was trying to run with pytorch/pytorch:1.3-cuda10.1-cudnn7-devel, leading to above no GPU found assert.

Finally, the diff for your existing v/s working Dockerfile is as follows:

--- a/Dockerfile
+++ b/Dockerfile
@@ -1,5 +1,5 @@
-ARG PYTORCH="1.3"
-ARG CUDA="10.1"
+ARG PYTORCH="1.2"
+ARG CUDA="10.0"
 ARG CUDNN="7"

 FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
@@ -17,6 +17,7 @@ RUN conda install cython -y && conda clean --all

 RUN git clone [removed-name] /[removed-name]
 WORKDIR /[removed-name]
+RUN git reset --hard c68890db5910eed4fc8ec2acf4cdf1426cb038e9
 RUN pip install --no-cache-dir -e .
 RUN cd /

The repository you were cloning above was working the last time your docker image was built i.e. Dec 10, and some of the commit currently in master branch has broke pip install. We will suggest to use versioning in your future submission so inconsistent state doesn’t occur on re-build/re-run.

I have shared the new error traceback in your submission’s gitlab issue above (after GPU assert went fine).

tl;dr I tried running your exact codebase with pytorch/pytorch:1.2-cuda10.0-cudnn7-devel base image & above Dockerfile diff. It seems to be working fine after it. Let us know in case there is any follow up doubt.

shivam · December 19, 2019, 1:22am

@shraddhaamohan Sorry for the confusion above, looks like you were submitting the baseline solution as it is, and this is bug in the same, instead of something you committed. We are updating the baseline with above fix.