We thank you for your continued feedback for improving this challenge, and we hope to keep making the challenge better!
Today, we are excited to announce some important changes to the evaluator!
1. GPU-enabled submissions
Now you can use GPU in your submissions! Just set
"gpu": True in your
2. Access to build and run logs
If your submission failed for some reason, you no longer would need to wait for someone to provide logs. You can check them yourself!
You can access the logs from the Evaluation Status issue comment on GitLab directly.
3. Updates to random item selection and update to gym-derk
[in response to Random items generation question]
Now, your agents are less likely to end up with no items. These changes related to items can be found here: http://docs.gym.derkgame.com/#items
[We have decided against letting agents choose items as we want them to generalize over them]
The change in random item selection seems to have had some unintended consequences. Submissions trained and submitted on previous versions of the environment can no longer repeat the leaderboard score. Then for new submissions the distributions of scores has changed.
For example, my submission developed during the warm-up round (and resubmitted for the first round) consistently scored ~2.6 (https://www.aicrowd.com/challenges/dr-derks-mutant-battlegrounds/submissions/116018). However, if resubmitted now scores around ~1.6 (https://www.aicrowd.com/challenges/dr-derks-mutant-battlegrounds/submissions/121655).
Then even submissions trained and evaluated on the new environment typically receive lower scores then before, so the distribution of scores has changed.
I am not complaining about that change. I think it was a good one. I am just saying the competition environment changed in the middle of the competition. The overview page for the competition says that round 1 ends Feb 15. Given the change in environment, if round 2 is not ready, wouldn’t it be a good idea to start a new round with a new leaderboard?
can confirm. We got the same issues.
Also it feels like that if I do same submission twice I get different results( it could be an issue on my side too).
Also I have suggestion that games should be symmetric. It means if you generated 1000 random combinations vs other 1000 combinations then sides should be swapped and play again the same combinations but for different sides.
Yes, if you submit the same submission twice you get different results (usually they are similar). I would expect this given the random item generation. However, I would expect averaging over many games would negate this effect.
Maybe 128 games is not enough to average over. Maybe we should try more?
Another possible solution would be to generate a secret random set items and keep it fixed during evaluation. There may be some other sources of randomness though.
Having a symmetric evaluation is good idea too, especially, if the secret set of items are kept fixed during evaluation. However, I also expect many random trails to have a similar effect.
Some of the score variation may be on our end as mentioned. I might test by fixing a random set of items.