Submission issues

Hello @jyotish

Submission #76265 seems to be stuck in some initial stage, it’s on “evaluation initiated” for 8 hours but never started.

Hello @dipam_chakraborty

We re-queued the submission for evaluation.

Hello @jyotish

Thanks for re-queuing #76265.

Can you please also help me with #76311, I believe if #76265 is running ok so should #76311, but it failed with OOM exactly after 14 iters for coinrun, bigfish and miner, and after 24 for caterpillar. In the pod info GPU memory section there are 4 items in #76265 and 5 items in #76311. I think this is somehow related to evaluation run, as the difference between these two shouldn’t cause OOM, rather #76311 should take slightly less memory.

Hello @dipam_chakraborty

Every element shown on the graph is from a different node. Ideally, you would have 8 elements on the graph, 4 for training and 4 for rollouts. Sometimes, you end with more elements because we use preemptive nodes, and every new node also shows up there. Sometimes, for the failed submissions, you might see less than 8 elements as the jobs exited before the metrics could be scraped for that job.

Hello @jyotish

Sorry for bothering about this issue too many times. But I’m certain #76348 and #76311 should not have hit OOM, local runs are taking up only 14.3 GB (including the evaluation worker). Just like #76311, in #76348 coinrun and miner hit OOM after exactly 14 iters, while bigfish didn’t, then the pod got reset and the new machine resumed from 75 iters, then it hit OOM after 75+14 = 89 iters. Is the memory usage so machine-dependent? Was something new introduced that is causing OOM errors after exactly 14 iters?

Hello @dipam_chakraborty

The GPU memory usage is not machine dependant. But it does vary a lot based on the version of tensorflow/pytorch used.

local runs are taking up only 14.3 GB (including the evaluation worker)

In that case, you can try replicating your local software stack during the evaluations (pip freeze / conda export)? Please let me know if I can help you with that.

Hello @jyotish

On local machine I have a conda environment with python 3.7 and only install requirements.txt mentioned below with pip.

ray[rllib]==0.8.5
procgen==0.10.1
torch==1.3.1
torchvision==0.4.2
mlflow==1.8.0
boto3==1.13.10

I haven’t tried pip freeze/conda export … will surely try it on the next submission. If you have any best known practices advice regarding that please let me know.

Hello @jyotish

Can you please clarify what happened in submission #77475, three environments ran fine but a lot of the log data is missing, and coinrun immediately failed, log file for coinrun is empty.

Hello @dipam_chakraborty

We had an issue with the metrics scraper because of which the pod metrics were not displayed. This is now fixed and should work as expected for the new submissions. We are checking the issue with the submission you mentioned. We will post the updates on the submission issue page.

Hello @jyotish

Submission #78454 failed after all steps completed, is it due to network issue?