The Orak Game Agent Challenge 2025 has come to a close. Over the course of the challenge, 497 participants across 117 teams took part, collectively producing 685 submissions and steadily improving agent performance throughout the competition.
Orak is an open benchmark designed to test agentic LLM systems in real games. Participants submitted MCP-connected agents capable of consuming textual and visual state across several environments including Super Mario, Pokémon Red, StarCraft II, and 2048.
Final Evaluation
Final standings were computed as a weighted average across four environments:
- Pokémon: 0.30
- StarCraft II: 0.30
- Super Mario: 0.15
- 2048: 0.15
The final evaluation includes hidden test cases designed to test generalisation, meaning final scores are typically lower than those observed on the live leaderboard.
LLM Usage Threshold
The challenge evaluates LLM-powered agents. During the final evaluation, individual game scores were treated as non-qualifying (zeroed) if language model usage fell below a minimum threshold.
This ensures the rankings reflect meaningful LLM-driven decision making, rather than approaches where classical solvers or rule-based controllers dominate the agent’s behavior.
Integrity Review
All submissions were reviewed against the published competition rules and clarifications to ensure that results reflect generalisable agent behavior.
Disqualification decisions were based on one or more of the following categories:
Hidden-test overfitting via hardcoding
Submissions containing game-specific routes, coordinates, or scripted behaviours tied directly to the public evaluation environment.
Disallowed action interfaces
Creation of new high-level actions beyond the predefined functions provided by the environment.
Tool restriction bypass
Use of external tools or services beyond what is permitted under the competition rules.
Reproducibility or verification failure
Submissions that could not be reliably reproduced or verified using the required code, artifacts, and logs.
Evaluation Updates Applied Before Finalising Results
To ensure fairness and consistent scoring across teams, the organisers applied an update to Pokémon before confirming final results. They fixed score normalisation to a consistent 0–1 scale and improved reset handling to clear milestone counters between episodes.
All final standings reflect results after these corrections were applied.
Winners
The final standings are based on the weighted evaluation described above.
Track 1: Lightweight (SLM ≤10B parameters)
| Rank | Team | 2048 | Mario | Pokémon | SC2 | Final Weighted Score |
|---|---|---|---|---|---|---|
|
|
a-great-toe (yucheon, hwanggeumhwan, yujin_kim, kgb) | 0.860 | 0.186 | 0.095 | 0.000 | 0.185 |
|
|
artist | 0.000* | 0.000* | 0.000 | 0.333 | 0.100 |
|
|
Actrix | 0.181 | 0.236 | 0.000 | 0.000 | 0.063 |
Track 2: Open (No parameter limit)
| Rank | Team | 2048 | Mario | Pokémon | SC2 | Final Weighted Score |
|---|---|---|---|---|---|---|
|
|
emaeon | 0.020 | 0.000* | 0.143 | 1.000 | 0.346 |
|
|
RickySong | 0.000* | 0.218 | 0.286 | 0.333 | 0.218 |
|
|
olawale_ibrahim | 0.001 | 0.177 | 0.000 | 0.333 | 0.127 |
Score zeroed due to LLM usage below the required threshold in that game.
Thank you to everyone who participated and contributed submissions throughout the challenge. We appreciate the experimentation, engineering effort, and persistence that went into building agents capable of operating across diverse game environments.
We will be reaching out to the winning teams shortly regarding prize distribution and will also share follow-up insights from the challenge with the community.