Are the specs for the machine that the evaluations are run on available anywhere?

It is hard to optimize the speed without knowing this. In particular, code changes that have given a 2x speedup on different local machines do not seem to have impacted the submission run time.

Hello @jon_grantham

The evaluations run on AWS EC2 instances. The resources available are as follows

GPU enabled flag in aicrowd.json vCPUs Memory GPU AWS instance type
true 4 16 GB NVIDIA T4 g4dn.xlarge
false 4 16 GB - m5.xlarge

Note:
During the evaluations, you will receive a proxy NetHack env object instead of the actual environment. This proxy object talks to the actual NetHack env over the network and returns the values as needed. We do this to prevent participants from tampering with the env. This also adds an overhead. Based on our benchmarks, a single env should roughly give a throughput of 1500-2000 steps/second. Using something like a batched env increases the throughput.

Hope this helps and please feel free to reach out to us for any help.

2 Likes

Thank you. That helps a lot.

When you say, “a single env should roughly give a throughput of 1500-2000 steps/second” – is that without considering time spent in the agent? How much of that is network latency? I.e., if we have an agent capable of performing faster than that, does all of the agent compute time get hidden under the latency?

1 Like

Hello @jon_grantham

is that without considering time spent in the agent?

Yes, it is without considering the time spent in the agent.

How much of that is network latency? I.e., if we have an agent capable of performing faster than that, does all of the agent compute time get hidden under the latency?

This includes the network latency, processing delay and everything that is needed to send a request and get the response. Any latency/compute time on the agent will be added to this.

For example, if you have something like

import aicrowd_gym
from tqdm import trange

env = aicrowd_gym.make("NetHackChallenge-v0")
env.reset()
for _ in trange(1000000):
    _, _, done, _ = env.step(1)
    if done:
        env.reset()

This should give you a throughput of 1500-2000 iterations per second during the evaluation.

Hello @jyotish,

If the throughput is in 1500-2000 range, it indicates that the maximum average number of steps is only {1500--2000} * 0.5 * 3600 / 128 = 21k -- 28k per game.
Also a note, that the example you gave doesn’t really take into account the environment step delay, because env.step(1) (1 is CompassDirection.E) is a no-op after a few steps when the character hits the wall (the turn counter stops to tick after that).

28k steps per game is an extremely tight limit if one were to go for ascension (we are!). Assuming that we need 200k turns for ascension, which can be roughly equivalent to 400k steps, indicates that we have to early drop at least 93% of all games. We already have >40k steps in average, and have to do hacks to maximize the median score, like quitting after exceeding the median (in fact Panic team does the same judging from the leaderboard).

Of course the assumption that agent takes no time to execute is not realistic. In our case environment takes ~15% of the entire execution time (measured locally so there’s no communication delay).

The competition goal is to develop the best agent, but right now it’s more like an performance optimization problem instead. Citing the challenge motivation: “The only restriction is on the compute and runtime during evaluation, though these will be set to very generous limits to support a wide range of possible implementations”. Currently the compute limit is far from being “very generous” and to do this, we believe the time limit should be increased at least 10 times.

So, we encourage you to increase the time limit as much as possible to give participants a better chance to beat the game. With the current limit ascension is extremely unlikely, but it’s quite possible if the limit is increased.

3 Likes

Hi All!

Firstly thanks for this discussion and everyone’s contribution to it. It’s been very valuable for us organizers!

Our solution to the above problem is not the change the rules (30 mins will still be the maximum each agent gets for acting), but rather to ensure that network latency does not ‘run down the clock’ on game time, or get counted (as it clearly shouldn’t).

As such, AIcrowd have implemented a solution that should allow a random model to take more than 20,000 steps per second, in the environment, sequentially. Under the previous calculations this should allow at least 20,000 * 0.5 * 3600 / 128 =~ 280k steps per episode for a random agent.

I should also note that in final evaluation, we run 4096 episodes over 24hrs (instead of 512 in 2 hrs) further adding to the max number of steps you can run can achieve 280k * (512 / 4) / (4096 / 48 ) .

I’m sure @dipam can fill in with more specific details on timing later. I hope this clears up the matter and many thanks to AIcrowd for making it possible!

1 Like

One more note: It’s worth highlighting at this point that the overhead that was being faced was down to communicating “chattily” over the network - something that probably could also have been mitigated by moving to of batches of environments, if you were previously operating serially. Thankfully though, AIcrowd’s solution requires no refactoring!