We’re inviting you to the Orak Game Agent Challenge Townhall. Join us for a walkthrough of the challenge, a deep dive into the problem statement and evaluation, and a live Q&A with the organisers.
Townhall Details
Date & Time: Friday, 9 January 2026 | 11:00 AM KST
Can’t make it live? No problem. We’ll share the recording after the event. You can also comment your questions on this post, and we’ll address as many as we can during the townhall.
I am currently participating in Track 1 of the Orak Challenge.
I have a few specific inquiries regarding the constraints on agent implementation. Could you please clarify the following points?
Internal Operations: Are we permitted to perform tasks such as pre-processing data, post-processing outputs, or enabling tool usage? Do we have full freedom to implement the agent’s internal logic beyond these examples?
LLM Call Frequency: Is there a strict rule that one act call must correspond to exactly one LLM call? Is it permissible to make multiple LLM calls or no calls at all within a single act step?
Model Usage: Is it mandatory to use a single 8B LLM to play all 4 games?
Timeout Settings: Is it allowed to adjust the timeout limit to accommodate computing power limitations?
I want to make sure there are no limitations on the internal logic or processes.
May we ask how the Super Mario hidden test will be? The hidden test will be from a different level of the Super Mario game or based on the current level that is used in the public leaderboard, and will make slight changes to the level. If the hidden test is from a different level, the conversion of the game screen to game state in text may have a problem because some elements may not be correctly detected and converted into game state in text. Besides, is it possible that Mario’s position (X coordinate) is sometimes reported incorrectly?
Since the criteria mentioned that if two or more teams are tied on the primary evaluation metric, rank them by below criteria, does it mean that we can use a classification ML model for games like 2048 or Super Mario instead of using LLM as it encourages lower mean LLM inference calls per evaluation episode?
– Lower model complexity — measured as Aggregate Total Parameters (ATP). ATP is the sum of total parameters across all distinct models used during the final official evaluation. “Total parameters” include all active and frozen weights, embeddings, adapters, and LoRA modules. For Mixture-of-Experts, count all experts (total parameters), not only the activated experts.
– Lower mean LLM inference calls per evaluation episode (fewer is better). Measured as: total inference calls made during the official evaluation ÷ number of evaluation episodes.
Is it allowed to use Python to do analysis and make a decision instead of using the LLM to minimize the LLM inference calls? As in discussion, it is mentioned that tool usage may be considered cheating if it explicitly exploits a known solution to the games—for example, using web search to retrieve established solutions. So, can we use Python to do analysis based on the game state plus rule-based to make decision instead of letting LLM to make decision?
Currently, the grpc server counts each action request as a step, not each response. Therefore, even if the response isn’t called 200 times, if you send multiple actions from the use_tool action, the server counts each as a step and recognizes that 200 steps have been reached. For example, when executing the continue_dialog action, the client counts it as one step, but the server counts each action (a) that moves the dialog box forward as one step. I changed this to a client-side step counting method, and I’m wondering if this is legal.
I’m curious about how much information is allowed in the prompt. The current rules seem ambiguous.
I have a quick question regarding model design for the competition.
To improve the performance of a small language model (SLM), is it allowed to use retrieval-augmented generation (RAG) or similar augmentation methods during generation?
If RAG is allowed, could you also let me know whether using a separate embedding model (e.g., OpenAI API or local embedding model) or performing local file I/O (e.g., loading retrieved documents) would be permitted?
As an additional question, are there any approaches or limitation for solving the multi-game setting that were not explored in the Orak paper?