Starter kit stuck "pending" state for a day

felixlaumon · February 19, 2019, 12:04am

Hi there, I am having trouble with my pod being stuck the in “pending” state. My submission was just the starter kit without any modifications.

Here’s a link to the issue: https://gitlab.aicrowd.com/felixlaumon/obstacle-tower-challenge/issues/1 and here’s aicrowd-bot says

---------------------------------------------
##### Pod Health :heart:
---------------------------------------------
- *Pod State* : `Pending`
- *Pod Scheduling Time* :clock1: : `26150 secs`


- *Containers Waiting* :large_orange_diamond: : `redis-server`,`aicrowd-subcontractor`

Thank you!

mohanty · February 19, 2019, 1:17pm

Hi @felixlaumon,
I have responded on the relevant issue.
We had some issues with the compute cluster used for the evaluation, and this submission fell through the cracks when doing some maintenance.
This will be requeued.

Update : The evaluation has been successfully completed now.

felixlaumon · February 21, 2019, 7:39pm

Hi @mohanty, it appears my new submission it’s stuck at pending evaluation. It seems the error message this time was something to do with timeout with Gitlab.

https://gitlab.aicrowd.com/felixlaumon/obstacle-tower-challenge/issues/3?_ga=2.107717063.490548295.1550777787-1678832906.1550777787

Another question do you know if failed evaluation counts towards the 100 submission quotas?

mohanty · February 21, 2019, 9:59pm

Requeued the submission.

felixlaumon · February 23, 2019, 5:28pm

Hi @mohanty I think I have isolated the problem to enabling GPU during evaluation. I made a new submission (issue #10) based off the obstacle-tower-challenge master from GitHub and the only modification is in aicrowd.json having gpu: true.

Is GPU not supported during evaluation or is this a bug? Let me know if you need more details in reproducing this issue.

Thank you!

mohanty · February 23, 2019, 5:45pm

Hi @felixlaumon,

Hey, yes, we also figure out some issues with the GPU configuration on the cluster, and have since fixed it. and requeued both your submissions which were stuck.

Sorry for the inconvenience.

Cheers,
Mohanty

felixlaumon · February 23, 2019, 8:25pm

@mohanty That’s good news. But it seems I have used up my quota today just to test out the submission Is it possible to reset my quota for today? Thanks!

mohanty · February 24, 2019, 6:35pm

Sorry saw the message only now.
But can you try now ?

felixlaumon · February 25, 2019, 12:44am

I just tried now. I am still getting this error:

 Submission failed : The participant has no submission slots remaining for today.

mohanty · February 25, 2019, 12:47am

@felixlaumon: I deleted all your failed submissions. It has to work now !

felixlaumon · March 5, 2019, 6:54am

Hi @mohanty unfortunately my last submission from last week was still failing. Can you please take a look at the issue https://gitlab.aicrowd.com/felixlaumon/obstacle-tower-challenge/issues/18?

It’s working in my local machine but the submission somehow failed. Thank you!

mohanty · March 5, 2019, 7:07am

@felixlaumon: I pasted the logs. But looks like its the timeout exception again (even if I see that you have a 10 min timeout set in your code). This might need a closer look from @arthurj , @harperj and @anhad

felixlaumon · March 11, 2019, 1:41am

@mohanty So I have finally successfully made my first submission (that is not a random agent)!

The one important change I made is to defer importing RainbowAgent and tensorflow. See this commit here https://gitlab.aicrowd.com/felixlaumon/obstacle-tower-challenge/commit/4979d7e65de6012e92405542a1ab73ac6ea16cb4. The evaluation ran successfully after this change.

I suspect that you guys might have a race condition when the agent (run.sh) and the environment (env.sh) are launched in docker. Importing RainbowAgent and tensorflow usually a bit of time (like a few seconds) and they might cause env.sh to try to listen to the port first before the env is ready in run.py.

I can replicate this issue locally as I always have to wait for Start training by pressing the Play button in the Unity Editor message to show up in run.py before I launch env.sh. Otherwise the environment will time out.

You can probably replicate this issue by trying to put time.sleep(10) in the very beginning for run.py.

While deferring import works for me for now, it is not ideal. So it will be great if you guys could look into this issue. And please let me know if there is any further information you’d like me to provide.

Thank you!

mohanty · March 11, 2019, 2:01pm

@felixlaumon: It would also be great if you can send in a pull request by adding a note about this to the gcp_training doc in the starter kit.

Cheers,
Mohanty