📹 Q&A with Challenge Organisers | Join the Townhall 9th January 11:00 AM KST

aicrowd_team · January 6, 2026, 3:12pm

Hello everyone,

We’re inviting you to the Orak Game Agent Challenge Townhall. Join us for a walkthrough of the challenge, a deep dive into the problem statement and evaluation, and a live Q&A with the organisers.

Townhall Details

Date & Time: Friday, 9 January 2026 | 11:00 AM KST
Zoom Link: https://us02web.zoom.us/j/86552259208?pwd=KXRHziNMMe1LffcZbfwt2WzzL7MRka.1

What to Expect

Challenge walkthrough
Deep dive into the problem statement and evaluation
Live Q&A with the panel

Panellists

Dongmin Park, AI Researcher, KRAFTON
Beongjun Choi, Deep Learning Researcher / Technical PM, KRAFTON (Ph.D.)
Jonghyun Lee, AI Research Scientist, KRAFTON AI
Jaewoo Ahn, Ph.D. Candidate, Vision & Learning Lab, SNU
Minkyu Kim, MS Student, KAIST AI
Howon Lee, Technical Product Manager (AI R&D)

Can’t make it live? No problem. We’ll share the recording after the event. You can also comment your questions on this post, and we’ll address as many as we can during the townhall.

Looking forward to seeing you there,
Team AICrowd

yucheon · January 7, 2026, 8:00am

Hello,

I am currently participating in Track 1 of the Orak Challenge.

I have a few specific inquiries regarding the constraints on agent implementation. Could you please clarify the following points?

Internal Operations: Are we permitted to perform tasks such as pre-processing data, post-processing outputs, or enabling tool usage? Do we have full freedom to implement the agent’s internal logic beyond these examples?
LLM Call Frequency: Is there a strict rule that one act call must correspond to exactly one LLM call? Is it permissible to make multiple LLM calls or no calls at all within a single act step?
Model Usage: Is it mandatory to use a single 8B LLM to play all 4 games?
Timeout Settings: Is it allowed to adjust the timeout limit to accommodate computing power limitations?

I want to make sure there are no limitations on the internal logic or processes.

Thank you.

cheong_wei_xun · January 7, 2026, 2:59pm

Hi, Dear Organizers and Panelists,

Our team has a few doubts,

May we ask how the Super Mario hidden test will be? The hidden test will be from a different level of the Super Mario game or based on the current level that is used in the public leaderboard, and will make slight changes to the level. If the hidden test is from a different level, the conversion of the game screen to game state in text may have a problem because some elements may not be correctly detected and converted into game state in text. Besides, is it possible that Mario’s position (X coordinate) is sometimes reported incorrectly?
May we ask how the Pokémon Red hidden test will be? Will it make changes to the task content, or having new task, or just make some changes to the map details? Because there are changes in task content or there is a new task, we might need to make changes to our prompt as well to tell LLM what to do.
Since the criteria mentioned that if two or more teams are tied on the primary evaluation metric, rank them by below criteria, does it mean that we can use a classification ML model for games like 2048 or Super Mario instead of using LLM as it encourages lower mean LLM inference calls per evaluation episode?

– Lower model complexity — measured as Aggregate Total Parameters (ATP). ATP is the sum of total parameters across all distinct models used during the final official evaluation. “Total parameters” include all active and frozen weights, embeddings, adapters, and LoRA modules. For Mixture-of-Experts, count all experts (total parameters), not only the activated experts.

– Lower mean LLM inference calls per evaluation episode (fewer is better). Measured as: total inference calls made during the official evaluation ÷ number of evaluation episodes.

Is it allowed to use Python to do analysis and make a decision instead of using the LLM to minimize the LLM inference calls? As in discussion, it is mentioned that tool usage may be considered cheating if it explicitly exploits a known solution to the games—for example, using web search to retrieve established solutions. So, can we use Python to do analysis based on the game state plus rule-based to make decision instead of letting LLM to make decision?

Thank you for your help on above issues.

hwanggeumhwan · January 8, 2026, 6:42am

Hello, I have a few questions.

Currently, the grpc server counts each action request as a step, not each response. Therefore, even if the response isn’t called 200 times, if you send multiple actions from the use_tool action, the server counts each as a step and recognizes that 200 steps have been reached. For example, when executing the continue_dialog action, the client counts it as one step, but the server counts each action (a) that moves the dialog box forward as one step. I changed this to a client-side step counting method, and I’m wondering if this is legal.
I’m curious about how much information is allowed in the prompt. The current rules seem ambiguous.

Thank you for your help.

inchangbaek · January 9, 2026, 1:48am

Dear organizers and panelists,

I have a quick question regarding model design for the competition.

To improve the performance of a small language model (SLM), is it allowed to use retrieval-augmented generation (RAG) or similar augmentation methods during generation?
If RAG is allowed, could you also let me know whether using a separate embedding model (e.g., OpenAI API or local embedding model) or performing local file I/O (e.g., loading retrieved documents) would be permitted?
As an additional question, are there any approaches or limitation for solving the multi-game setting that were not explored in the Orak paper?

Looking forward to seeing you at the town hall!

gimun · January 9, 2026, 2:40am

Sorry for late. It looks done for now. I’ll watch the record instead.