Rewards function calculation bug


Hi FLATland team!

I want to report a bug in rewards function (latest commit - 05.10).

I found that the system do not count any of score penalty (reward) of an agent if it has not started moving and formally does not exist on a map. It means that agent can wait in it`s start point for some steps for free.

Please, add rewards calculation for non-spawned agents like they are waiting.


Hi @vetrov_andrew

Thank you for pointing this out. We actually realized this ourselfes when training the agents.
Our suggestions would be to penalize the agents yourself for waiting. In future releases we will have more complex schedules where it is ok for agents to wait before entering the environment.

We suggest the following two solutions:

  1. Penalize waiting agents by looking at the info returned by the environment if env.agents[a].status == 0: all_rewards[a] -= 1
  2. Use other algorithms to decide what agent should enter when and only us RL for agents in the environment.

Does this help? Otherwise we can discuss further if this needs to be implemented and push it in the next update.

Best regards,



Well, actually I wanted to clarify the process of calculating the final score of our submissions.
I guess that you use this reward function to calculate optimality of solution in Round 1, as long as I found this line in script:

 if done['__all__']:
        print("Reward : ", sum(list(all_rewards.values())))

So, if you still use this approach (and you definitely use it in, the score function is calculated incorrectly.

For example, I can order some agents not to move and enter the environment. In this case, they have no impact on total penalty, so the final score reduces (which of course is incorrect) - I can describe this with more details, if you want.

Thus, there is a bug in score calculation, which can be fixed by changing the default rewards function or making anything else.

Sorry for any of misunderstandings.


Thanks @vetrov_andrew

For clarifying you bug report. I agree with your observation and will adjust this to the upcoming schedules.
This means any agent will be punished for leaving with a delay. By setting all starting times to 0 we basically introduce the negative reward for not entering.

I’m happy you brought this to our attention.

Best regards,