Question on Scoring

The top team in the Track 1 leaderboard currently has a perfect score of 1.0 for all 4 games. I imagine there will soon be more teams with this score. I understand that the final evaluation will include hidden test cases/scenarios, but how can we determine if our agents are improving relative to other teams if we all have perfect scores of 1.0? Perhaps the current evaluation cases should be more difficult so we can better compare different teams progress?