[Updated] Customize Dockerfile for both phase

After several data debug, we finally made a successful submission inference online only. For those teams who need to build their own docker image, here is the minimal template:


FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 # The host driver version is 450.172.01, only support CUDA:11.03 or earlier. Check Docker Hub for more other base image
RUN apt-get update && apt-get upgrade -y && apt-get install -y python3 python3-pip && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip install <some package>
RUN ln -s /usr/bin/python3 /usr/bin/python
COPY models/* /models/
COPY utils /usr/local/lib/python3.6/dist-packages/starter_kit/ # put all necessary code into utils, Given the different operation system, the python version might differ

# The online submission environment requires running as a non-root user named aicrowd. If run as root, the system can pass public phase but will fail on private phase. I guess the system try to delete the file generated at public phase but fail due to no permission.
ENV USER aicrowd
ENV HOME /home/aicrowd
RUN groupadd --gid 1001 aicrowd
RUN useradd --comment “Default user” --create-home --gid 1001 --no-log-init --shell /bin/bash --uid 1001 aicrowd
USER aicrowd


I maintain the repository structure as:

├── aicrowd.json
├── Dockerfile
├── models/
├── README.md
└── utils/run.py


Note: This docker file works for both the public test and private test phase.

3 Likes

Hello, I want to know if the configuration you showed should be written in Dockerfile? I really don’t know how to start building my environment,I’m looking forward to your reply!

As I said in the note, this configuration only works for the public test phase but will fail during the private test phase. I am also waiting for a reply from the aicrowd team.

Hi, Thank you very much for the example!
Just want to know, where do you put the data file?

For simplicity, I put all data into /models/ no matter if it is a model weight or processed data. You can use your directory structures.

Hi, wufanyou.
I followed your Dockerfile here, but got some error, do you know how can I get the error log? Thank you!

Status: FAILED
Build failed :frowning:
Last response:
{
“git_uri”: “git@gitlab.aicrowd.com:yuan_ling/task_2_query-product_ranking_code_starter_kit.git”,
“git_revision”: “submission-14”,
“dockerfile_path”: null,
“context_path”: null,
“image_tag”: “aicrowd/submission:192436”,
“mem”: “14Gi”,
“cpu”: “3000m”,
“base_image”: null,
“node_selector”: null,
“labels”: “evaluations-api.aicrowd.com/cluster-id: 2; evaluations-api.aicrowd.com/grader-id: 77; evaluations-api.aicrowd.com/dns-name: runtime-setup; evaluations-api.aicrowd.com/expose-logs: true”,
“build_args”: null,
“cluster_id”: 1,
“id”: 4070,
“queued_at”: “2022-07-05T18:59:22.444264”,
“started_at”: “2022-07-05T19:00:02.492853”,
“built_at”: null,
“pushed_at”: null,
“cancelled_at”: null,
“failed_at”: “2022-07-05T19:00:16.845263”,
“status”: “FAILED”,
“build_metadata”: “{“base_image”: “nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04.”}”
}

Sorry. I put one more dot here. There is no dot at the end. You need to check tags for docker hub
FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04

Check my latest post. This dockerfile should work with just a few modifications.

Really thank you for your exploring and sharing. And I have some comment which may be helpful for someone.

Exactly, I find that 450 driver can support all 11.x cuda. And if you only use pytorch, you don’t need to install cuda by yourself since pytorch has packed a cuda (that is why pytorch has a so large whl file).

1 Like

Yes, it said that the 450 drivers support all Cuda 11. X like 11.4. But I test it first at 11.4 11.3 11.2 locally with 450 drivers. Only 11.0.03 works. So that what I suggest 11.03 is the highest one.

I test torch-1.12.0+cu113 and torch-1.12.0+cu116 on my 450 driver machine, both of them can use gpu normally.

In [1]: import torch

In [2]: !nvidia-smi | head -n 4
Wed Jul  6 11:06:06 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+

In [3]: torch.version.cuda
Out[3]: '11.6'

In [4]: torch.zeros((2, 2)).cuda()
Out[4]:
tensor([[0., 0.],
        [0., 0.]], device='cuda:0')

Right. If you work with PyTorch then it should be fine.

I try to build with the following code:

FROM nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
RUN apt-get update && apt-get upgrade -y && apt-get install -y python3 python3-pip && apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip install pandas
RUN ln -s /usr/bin/python3 /usr/bin/python
COPY models/* /models/
COPY utils /usr/local/lib/python3.6/dist-packages/starter_kit/

The online submission environment requires running as a non-root user named aicrowd. If run as root, the system can pass public phase but will fail on private phase. I guess the system try to delete the file generated at public phase but fail due to no permission.

ENV USER aicrowd
ENV HOME /home/aicrowd
RUN groupadd --gid 1001 aicrowd
RUN useradd --comment “Default user” --create-home --gid 1001 --no-log-init --shell /bin/bash --uid 1001 aicrowd
USER aicrowd

And my repository structure is same as you. but I get the result that the build fails. Can you help me see what is the reason?

I tested your dockerfile locally. It said that pip is not found. This is an operating system special problem. For ubuntu 18.04, you need to use pip3 by default. when using pip3 install some packages. This problem is solved. I suggest that before submitting it, you should test it locally.

Thank you. I’m not very good at using docker. I’m struggling to build a local test environment

I used pip3, but I got this error:

=================
pip install failed.
Trying to setup AIcrowd runtime.
pip install failed.
Trying to setup AIcrowd runtime.
pip install failed.
Trying to setup AIcrowd runtime.
pip install failed.
Trying to setup AIcrowd runtime.

I don’t change aicrowd.json, is there some values need to be set?

Hi,
When you test locally,
Does this command work?
COPY utils /usr/local/lib/python3.6/dist-packages/starter_kit/
it says:
COPY failed: file not found in build context or excluded by .dockerignore: stat usr/local/lib/python3.7/dist-packages/starter_kit/: file does not exist

I got the same error, will you be able to solve it? Thanks
:woozy_face:
I got this error 90mins after submission.

This command is operation specific. Which base images you used?

I did not meet this problem.