The change in random item selection seems to have had some unintended consequences. Submissions trained and submitted on previous versions of the environment can no longer repeat the leaderboard score. Then for new submissions the distributions of scores has changed.
Then even submissions trained and evaluated on the new environment typically receive lower scores then before, so the distribution of scores has changed.
I am not complaining about that change. I think it was a good one. I am just saying the competition environment changed in the middle of the competition. The overview page for the competition says that round 1 ends Feb 15. Given the change in environment, if round 2 is not ready, wouldn’t it be a good idea to start a new round with a new leaderboard?
can confirm. We got the same issues.
Also it feels like that if I do same submission twice I get different results( it could be an issue on my side too).
Also I have suggestion that games should be symmetric. It means if you generated 1000 random combinations vs other 1000 combinations then sides should be swapped and play again the same combinations but for different sides.
Yes, if you submit the same submission twice you get different results (usually they are similar). I would expect this given the random item generation. However, I would expect averaging over many games would negate this effect.
Maybe 128 games is not enough to average over. Maybe we should try more?
Another possible solution would be to generate a secret random set items and keep it fixed during evaluation. There may be some other sources of randomness though.
Having a symmetric evaluation is good idea too, especially, if the secret set of items are kept fixed during evaluation. However, I also expect many random trails to have a similar effect.