Question on Scoring

ben_swain · December 31, 2025, 12:48am

The top team in the Track 1 leaderboard currently has a perfect score of 1.0 for all 4 games. I imagine there will soon be more teams with this score. I understand that the final evaluation will include hidden test cases/scenarios, but how can we determine if our agents are improving relative to other teams if we all have perfect scores of 1.0? Perhaps the current evaluation cases should be more difficult so we can better compare different teams progress?

zi_hong_chen · December 31, 2025, 3:42am

Same question, I believe the current evaluation is no more suitable for this competition.

RickySong · December 31, 2025, 4:41am

It makes sense.
There are tie-breaking criteria, but we are unaware of each other’s specifications for them.
A perfect score will soon no longer be an unachievable goal for the other team, which means the tie-breaking criteria will become more important.

In addition, for Track 2 (unlimited LLM size), we are unaware of the commercialized size of each LLM. (For example, which is bigger: ChatGPT or Gemini?) Therefore, it seems necessary to have a metric that is easier to understand and shows how much we have achieved compared to other teams.

Tie-breaking criteria
If two or more teams are tied on the primary evaluation metric, rank them by the following criteria in order (earlier items take precedence):

Lower model complexity — measured as Aggregate Total Parameters (ATP). ATP is the sum of total parameters across all distinct models used during the final official evaluation. “Total parameters” include all active and frozen weights, embeddings, adapters, and LoRA modules. For Mixture-of-Experts, count all experts (total parameters), not only the activated experts.
Lower mean LLM inference calls per evaluation episode (fewer is better). Measured as: total inference calls made during the official evaluation ÷ number of evaluation episodes.
Shorter mean prompt length (tokens) per evaluation episode (fewer is better). Measured as: total tokens sent to models during the official evaluation ÷ number of evaluation episodes. Tokens are counted by the Competition Organizer-designated tokenizer.
Earlier final-submission timestamp — if still tied.

ben_swain · December 31, 2025, 5:05am

One issue I see with the tie-breaking criteria is that it is fairly simple to create rule-based scripts to achieve perfect scores in the games. There’s probably already code for this online- so what is preventing teams from relying primarily on rule-based scripts to perform most of the work while a minimal LLM is included just to ensure the team has the optimal tie-breaking solution?

Since there’s nothing in the rules disallowing this, it is likely that the winning solutions will work this way.

RickySong · December 31, 2025, 6:08am

I think that when conducting qualitative analysis evaluations, scores will likely be lower for cases where the LLM was not utilized effectively. However, since there are no publicly disclosed criteria for evaluating the LLM prompts or their level of activation, it is currently unclear how this will be determined.

In my opinion, it is not just about achieving a high score; there needs to be a standard for evaluating how well the LLM(or prompt) was utilized.

ben_swain · January 1, 2026, 9:54pm

I think we need clarification on this. I’ll pause work on this competition until then

howon_lee · January 6, 2026, 9:47pm

Hi everyone — thanks for the question.

Short answer: the live leaderboard scoring is provisional and is not the same evaluation used to determine final winners.

Phase 1 (Live Leaderboard) uses lightweight, real-time evaluation to help teams track progress during the competition. Leaderboard positions are provisional.
Phase 2 (Final Evaluation) happens after the final submission deadline. Teams submit full reproducible packages, and final rankings and winners are determined solely from this reproducible evaluation .
Tie-breakers (model size, inference calls, prompt length, etc.) are applied only in the final official evaluation , not on the live leaderboard.

This is explicitly stated in the rules (e.g., “live leaderboard positions are provisional” and “tie-breakers are applied only to the final official evaluation”).

We’ll walk through the evaluation flow and answer questions in more detail during the Townhall on Jan 9, 11:00 AM KST:

Feel free to post follow-up questions ahead of time or bring them to the session.