I have some more relatively straight forward questions about the submissions.
After reading the rules I understand that it is perfectly OK to use operations research methods instead of reinforcement learning to find the solution, is my understanding correct, please?
The time-step limit for each round is set to 1.5(height + width), however from the starting-kit it is not clear whether there is an actual time limit for each round’s computations. If there is, how is it calculated or what is the general value, please?
Is it possible for participants to modify their run.py script? i.e. Say that I would want to instantiate an object in the script that would actually do the decision making rather than the controller function, is this possible, please? The initial outlook of the run.py script seems somehow constraining to me.
Thank you very much for your answers and clarification.
Yes you can use any algorithm you would like to use. Keep in mind that computational complexity will vastly increase in round 2 as we allow for different train speeds.
Currently you have 8 hour computational time limit to solve 1000 environments. Also if there is no action performed by your controller for 15 minutes the submission scoring will be aborted. (@mohanty anything more to add here?) (Update from @mohanty : you will have access to 3 cores of CPU and 8 GB of RAM. No GPUs are available at this point of time. )
Yes you can modify you run.py script . For example: You can pre-compute all the action you want to do (e.g. as a list) and your controller just provides the appropriate actions to the environment at each step. The environment was built with reinforcement learning in mind and thus the action at each step is needed. Therefore OR approaches need to do a little hacking to change their results into lists of actions.
Each submission has 4 CPU and 16GB of RAM, currently no GPU (@mohanty please update if we have changes here )
The score is mean percentage of agents who solved their own objective (arrived at their destination). E.g if half the agents arrive at their target in time the score is 0.5. Because there might be several submissions with the same score we also compute the mean reward which is just the mean reward over all agents and all episodes. (reward = -1 for each step, reward = 0 for agent at its target, reward = 1 for all agents if all agents reach target)
The scores will not be combined. Only scores from round 2 will be considered for the prizes.
For round 1 the max number of trains is set to 50 at a size of 100x100.
If the scores of round 1 and round 2 won’t be combined, what’s the point of round 1? why not start with round 2 directly (and have some percentage of environments where all agents have the same speed, aka round 1 type of environments - this percentage can even be 0 if it doesn’t represent a sufficiently important case)
Thanks for your replies. There will be an announcement later today and an update to the rules to clarify how submission scoring works. And how prizes are awarded.
Where is it mentioned that the max number of allowed time steps is 1.5*(width+height)? Last time I read only that such a constant exists, but I didn’t see it mentioned (so I assumed it’s hidden and maybe even different for each test case). If it’s indeed fixed at 1.5 for every test case (can anyone confirm?) I would like to use that in my solution.
You are correct with your assumption about the max number of steps allowed in Round 1.
The rail environment will terminate a episode at step = 1.5 (width+height).
ATTENTION: If you plan to use the max number of steps per episode in your code, be sure to make it variable as this is likely to change for the more complex Round 2.
The score is mean percentage of agents who solved their own objective
Does it mean that you calculate arrived_trains / len(trains) and then calculate mean over all episodes?
Or you calculate global amount of arrived trains and divide on global count of trains.
The mean percentage of agents done is calculated as you expected. Number of arrived trains divided by total number of trains and then we take the mean over all episodes.
If we would calculate the mean over all episodes ones we would bias the mean towards the results of the larger envs. In the current setting the bias is towards smaller envs where few agents have more influence on the mean of agents finished.