AWS instance setup

Hi everyone,

I’m trying to set up an AWS instance with Round 1 available hardware:

vCPUs 8
GPU 16 GB Tesla V100

It seems that p3.2xlarge has 8vCPUs, the correct GPU and 61GiB or memory, so about 65 GB.

When I run the default script I however face two issues:

  1. There are only 2 CPUs available (not 8).

  2. I get something like “failed to put object 4e8e6bbb00a431564d81fd5d010000c801000000 in object store because it is full. Object size is 302591732 bytes. Waiting 16000ms for space to free up…” until it crashes (full trace)

I’ve been playing a bit with RAY_MEMORY_LIMIT and RAY_STORE_MEMORY, setting them to 55M and 80M or very big values, but still get the same problem. My current guess is that the “Memory” does not correspond to RAM but to something else?

Really curious if someone has encountered similar problems.

Hi @mtrazzi,

  1. Please update for RAM/CPU you want to use for your run. I assume you are getting 2 CPUs instead of 8 because of default value here. On hindsight, I think we shouldn’t keep default value in starter kit & let rllib automatically detect available resources.
  1. RAY_MEMORY_LIMIT & RAY_STORE_MEMORY are “total” memory reserve and not per timestep, etc, hence you probably want to reserve something in the order of GBs and not MBs, and try again (ideally start with 32GB+ given your system configuration and reduce/increase based on your use case).
    In case this doesn’t solve your issue:
    (a) My initial hunch would be if it is related to lru_evict – in case you have enabled it in your script? if not please let us know if:
    (b) any major change done in memory-related configuration, so we can help to debug accordingly?

Adding to @shivam’s response, we use the following values during the evaluation

RAY_MEMORY_LIMIT: "60129542144"
RAY_STORE_MEMORY: "30000000000"
1 Like

Hi @shivam,

With RAY_CPUS=8 and @jyotish’s config it works, thanks!