Hello everyone,
We’re inviting you to the Orak Game Agent Challenge Townhall. Join us for a walkthrough of the challenge, a deep dive into the problem statement and evaluation, and a live Q&A with the organisers.
Townhall Details
What to Expect
- Challenge walkthrough
- Deep dive into the problem statement and evaluation
- Live Q&A with the panel
Panellists
- Dongmin Park, AI Researcher, KRAFTON
- Beongjun Choi, Deep Learning Researcher / Technical PM, KRAFTON (Ph.D.)
- Jonghyun Lee, AI Research Scientist, KRAFTON AI
- Jaewoo Ahn, Ph.D. Candidate, Vision & Learning Lab, SNU
- Minkyu Kim, MS Student, KAIST AI
- Howon Lee, Technical Product Manager (AI R&D)
Can’t make it live? No problem. We’ll share the recording after the event. You can also comment your questions on this post, and we’ll address as many as we can during the townhall.
Looking forward to seeing you there,
Team AICrowd
1 Like
Hello,
I am currently participating in Track 1 of the Orak Challenge.
I have a few specific inquiries regarding the constraints on agent implementation. Could you please clarify the following points?
- Internal Operations: Are we permitted to perform tasks such as pre-processing data, post-processing outputs, or enabling tool usage? Do we have full freedom to implement the agent’s internal logic beyond these examples?
- LLM Call Frequency: Is there a strict rule that one
act call must correspond to exactly one LLM call? Is it permissible to make multiple LLM calls or no calls at all within a single act step?
- Model Usage: Is it mandatory to use a single 8B LLM to play all 4 games?
- Timeout Settings: Is it allowed to adjust the timeout limit to accommodate computing power limitations?
I want to make sure there are no limitations on the internal logic or processes.
Thank you.
1 Like
Hi, Dear Organizers and Panelists,
Our team has a few doubts,
-
May we ask how the Super Mario hidden test will be? The hidden test will be from a different level of the Super Mario game or based on the current level that is used in the public leaderboard, and will make slight changes to the level. If the hidden test is from a different level, the conversion of the game screen to game state in text may have a problem because some elements may not be correctly detected and converted into game state in text. Besides, is it possible that Mario’s position (X coordinate) is sometimes reported incorrectly?
-
May we ask how the Pokémon Red hidden test will be? Will it make changes to the task content, or having new task, or just make some changes to the map details? Because there are changes in task content or there is a new task, we might need to make changes to our prompt as well to tell LLM what to do.
-
Since the criteria mentioned that if two or more teams are tied on the primary evaluation metric, rank them by below criteria, does it mean that we can use a classification ML model for games like 2048 or Super Mario instead of using LLM as it encourages lower mean LLM inference calls per evaluation episode?
– Lower model complexity — measured as Aggregate Total Parameters (ATP). ATP is the sum of total parameters across all distinct models used during the final official evaluation. “Total parameters” include all active and frozen weights, embeddings, adapters, and LoRA modules. For Mixture-of-Experts, count all experts (total parameters), not only the activated experts.
– Lower mean LLM inference calls per evaluation episode (fewer is better). Measured as: total inference calls made during the official evaluation ÷ number of evaluation episodes.
- Is it allowed to use Python to do analysis and make a decision instead of using the LLM to minimize the LLM inference calls? As in discussion, it is mentioned that tool usage may be considered cheating if it explicitly exploits a known solution to the games—for example, using web search to retrieve established solutions. So, can we use Python to do analysis based on the game state plus rule-based to make decision instead of letting LLM to make decision?
Thank you for your help on above issues.