Round 2 evaluation details

slopez · October 5, 2020, 3:11pm

Hello! I just joined the 2020 Flatland challenge and I’m very excited to see where RL takes us.

I have a couple questions about the evaluation:

Why are environments grouped into tests?
From here, I infer that the only difference between levels in a same test is the malfunction interval, and tests differ by all other parameters, is this correct?
While evaluating locally, it seems that the evaluator service picks a random environment from all the environment tests. How does that work in the online evaluation, if the list of tests is infinite?
Is the score computed on all evaluated environments, independent from which test they come from?

Thank you!

MasterScrat · October 5, 2020, 3:21pm

Why are environments grouped into tests?

All the environments in the same test have the same parameters: height, width, number of agents… the only difference is the malfunction rate. The performance of your submission is evaluated one Test at a time, and you need to have on average 25% of the trains reaching their destination to move on to the next Test. We’re interested in seeing up to what size a submission can perform well enough.

From here, I infer that the only difference between levels in a same test is the malfunction interval, and tests differ by all other parameters, is this correct?

Correct! well not all the other parameters change, eg max_rails_in_city , malfunction_duration etc are all constant across all the levels.

While evaluating locally, it seems that the evaluator service picks a random environment from all the environment tests. How does that work in the online evaluation, if the list of tests is infinite?

From 🚂 Here comes Round 2! :

Note: Now that the environments are evaluated in order (from small to large), you should test your submissions locally in the same conditions. You can use the --shuffle flag when calling the evaluator to get a consistent behavior:

flatland-evaluator --shuffle False

Is the score computed on all evaluated environments, independent from which test they come from?

The final score is the sum of the normalized return across all the evaluated environments, yes. So, 0.5 point in an environment in Test_0 is worth as much as 0.5 point in an environment in Test_30.

slopez · October 5, 2020, 3:30pm

Alright, thanks for the very clear answers!

beibei · October 13, 2020, 9:38am

Hi @MasterScrat, I am not sure if I understood correctly: the different levels in one test have the same railway settings and same number of agents. The difference is the malfunction rate. Does it mean within one test,

the railway networks (maps) are same
he initial position and target position) for agents are same
agents will be in malfunction in different time and with different time range

My another concern is about the timesteps. When I evaluate locally, there is “Evaluation finished in *** timesteps…”. Does each environment (level) still have the timestep limit? Or the score is calculated based on the done agents and the timesteps? Besides, how do you calculate the total reward on the leaderboard? Is it the sum of the normalized reward in each environment?

Many thanks!

MasterScrat · October 13, 2020, 9:58am

Hey @beibei,

the different levels in one test have the same railway settings and same number of agents. The difference is the malfunction rate.

Correct!

Does it mean within one test,

the railway networks (maps) are same

he initial position and target position) for agents are same

No, the railway networks and initial positions and targets are different for every level, even within the same test.

The parameters within one test are fixed (except for the malfunction rate), but each environment is still procedurally generated from these parameters, which results in different maps for each environment.

agents will be in malfunction in different time and with different time range

The rate of malfunction changes between the different environments within the same test. The maximum rate of malfunction (per agent) is max_mf_rate = 1.0 / min_malfunction_interval = 1.0 / 250.

You can see more in details how the malfunction rate changes within a test here: https://flatland.aicrowd.com/getting-started/environment-configurations.html#round-2

The malfunction time range is malfunction_duration = [20,50] for all the environments in all the tests (sampled uniformly).

My another concern is about the timesteps. When I evaluate locally, there is “Evaluation finished in *** timesteps…”. Does each environment (level) still have the timestep limit? Or the score is calculated based on the done agents and the timesteps? Besides, how do you calculate the total reward on the leaderboard? Is it the sum of the normalized reward in each environment?

Each environment does have it’s own timestep limit as in Round 1, that you can get from self.env._max_episode_steps. It is defined as int(4 * 2 * (env.width + env.height + num_agents / num_cities)) (see https://gitlab.aicrowd.com/flatland/flatland/blob/master/flatland/envs/schedule_generators.py#L188).

The score is calculated based on the done agents and the timesteps. We use the same normalized reward as in Round 1, but add 1.0 to make it between 0.0 and 1.0:

normalized_reward = 1.0 + sum_of_rewards / (self.env._max_episode_steps * self.env.get_num_agents())

And then indeed the total reward that counts for the leaderboard is the sum of the normalized reward for each environment.

You have more details here: https://flatland.aicrowd.com/getting-started/prize-and-metrics.html

And in the Round 2 announcement post: 🚂 Here comes Round 2!

And in the Round 2 environment configuration page: https://flatland.aicrowd.com/getting-started/environment-configurations.html#round-2