TL;DR: We are re-calculating scores using the correct normalization. We don’t expect any significant changes in the ranking, only on the numerical values.
Dear Participants - We are making an amendment to the evaluation score shown on the leaderboard. Unfortunately we noticed that score computation had a normalisation discrepancy with that of Melting Pot 2.0. We don’t expect the ranking of any of the submissions to change, only the numerical score. The Melting Pot 2.0 tech report defines a normalization of return for a scenario to be zero for the baseline with the lowest focal per-capita return, and one for the highest one. The normalization currently reported used the exploiter return from Melting Point 2.0 as the upper bound and random performance as lower bound. For most scenarios these are the same, but there are some for which this is not the case.
Going forward we are updating the leaderboard score to use the normalization from Melting Pot 2.0. We will use this updated normalization for the generalization score to be displayed from Nov 1 too.
Please note: This change will only reflect as a scale factor on overall score where in your performance on some scenarios may slightly decrease by the same scale for all teams and hence we do not expect any significant change in the current leaderboard ranking due to this. We will be using the new score moving forward for both development and generalization phase and also final evaluation.