I can confirm that GPU is available for evaluations if you have used gpu: true
in your aicrowd.json, and they were not removed at any point. In case someone is facing launching GPU in their submission, please share your submission ID with us so it can be investigated.
@shraddhaamohan, in your submission above i.e. #27829, your asset was assert torch.cuda.is_available()==True,"NO GPU AVAILABLE"
which wasn’t showing full issue.
I tried to debug it on your submitted code, and this was happening:
>>> import torch
>>> torch.backends.cudnn.enabled
True
>>> torch.cuda.is_available()
False
aicrowd@aicrowd-food-recognition-challenge-27829-38f8:~$ nvidia-smi -L
GPU 0: Tesla K80 (UUID: GPU-cd5d75c4-a9c5-13c5-bd7a-267d82ae4002)
aicrowd@aicrowd-food-recognition-challenge-27829-38f8:~$ nvidia-smi
Tue Dec 17 14:19:22 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79 Driver Version: 410.79 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 47C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
We further found this is happening because the underlying CUDA version we provide to submissionos was 10.0 and submissions are evaluated with docker image “nvidia/cuda:10.0-cudnn7-runtime-ubuntu18.04”. While in your submission you have custom Dockerfile which was trying to run with pytorch/pytorch:1.3-cuda10.1-cudnn7-devel
, leading to above no GPU found assert.
Finally, the diff for your existing v/s working Dockerfile is as follows:
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,5 +1,5 @@
-ARG PYTORCH="1.3"
-ARG CUDA="10.1"
+ARG PYTORCH="1.2"
+ARG CUDA="10.0"
ARG CUDNN="7"
FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel
@@ -17,6 +17,7 @@ RUN conda install cython -y && conda clean --all
RUN git clone [removed-name] /[removed-name]
WORKDIR /[removed-name]
+RUN git reset --hard c68890db5910eed4fc8ec2acf4cdf1426cb038e9
RUN pip install --no-cache-dir -e .
RUN cd /
The repository you were cloning above was working the last time your docker image was built i.e. Dec 10, and some of the commit currently in master branch has broke pip install. We will suggest to use versioning in your future submission so inconsistent state doesn’t occur on re-build/re-run.
I have shared the new error traceback in your submission’s gitlab issue above (after GPU assert went fine).
tl;dr I tried running your exact codebase with pytorch/pytorch:1.2-cuda10.0-cudnn7-devel
base image & above Dockerfile diff. It seems to be working fine after it. Let us know in case there is any follow up doubt.