Suggest for competition

It is currently known that the biggest difficulty for competitors is the submissions.

Although we have good validation on local, even GCP
But there are still many people who have encountered difficulties.
This is nothing about Reinforcement Learning, but I feel that only knowing that the solution is not a good competition.

Here are some of the results I tested in the past few days that I consumed my quota.
The image file currently used by Aicrowd:

nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04

So the way to test similar environments can be set in GCP Compute.:

My language is Chinese, but it won’t matter. I should pay attention to Cuda 9.0 instead of the default cuda10. Although used in docker does not affect the results, some contestants may need to export the configuration files of the original environment in non-docker.

(GPU should be officially approved)

This will be more similar to Aicrowd’s execution environment, and when evaluating,

docker run \
  --env OTC_EVALUATION_ENABLED=true \
  --network=host \
  -it obstacle_tower_challenge:latest ./run.sh

Should add attributes

--runtime=nvidia

This allows you to use the nvidia driver in docker, will automatically apply cuda in docker.

In addition, GCP has pre-configured the nvidia runtime docker for you. If you are configuring in a non-GCP environment, you should refer to this.

Another special reminder is that
The newline symbol for the Dos file is ^J^M
The newline symbol for Unix files is ^J
So in some cases you will encounter bash problems, because it is the relationship of the Windows environment configuration file, you should use Vim and other similar editors to change the encoding.

In addition, can you ask the administrator to give me more quota? :rofl::rofl:

1 Like

Hi @ChenKuanSun,

Thanks for your insightful analysis !

But I am afraid, this is not a cuda/gpu issue, as I myself ran a benchmark with numerous submissions earlier which were hogging onto the GPU, and all of them succeeded without error.

And the evaluation setup, is running a few other quite intense private competitions, and there we havent seen any issue like this.

This seems to be because of some nuance in the communication between the evaluation binary and the submitted code. And another issue we also noticed, that many of the images were just huge, with a few layers as big as 20GB !! and just pull those images from the docker registry was taking way too much time. But that was also fixed yesterday (correct me if I am wrong @shivam ).
So seeing these errors again is indeed troubling :confused:

I will try to do a little bit more debugging today, and see how it goes. And the nvidia-runtime is already present in the cluster, and is definitely not the cause of the issue.

I mean the participants want to simulate the configuration of the environment themselves, they can use these methods.

Can you please help me by the way to check why my CKPT file will have a wrong problem when clone to build from repo?
The correct size should be 4-5GB

I think I found the issue !!
Our git setup was misconfigured by mistake :confused:

The general policy is that on gitlab.aicrowd.com, you should not be able to checkin a file larger than 20Mb directly ! As that bloats up the git history, and causes a whole array of issues downstream. Including the bloating of the images.

I just modified the server, to ensure that no file larger than 20Mb be directly checked into the repository.

To check in larger model weights, you will have to use git-lfs. More on that here : https://about.gitlab.com/2017/01/30/getting-started-with-git-lfs-tutorial/

In the meantime, can you delete your current repository (take a backup ofcourse ;)) , then fork the starter kit again, add in your code files, add your checkpoints via git-lfs, and then try submitting again ?

OK , I try again~~~~~~

Can that help me increase my quota, because this doesn’t seem to be because of the problem I submitted?
:rofl::rofl: