Announcement: Clarifications on Prompts, Fine-tuning, Tool Usage, and Hidden Test Cases

aicrowd_team · January 13, 2026, 1:59pm

We would like to share clarifications and updates regarding prompts, fine-tuning datasets, tool usage, and hidden test cases. Some points below clarify or update discussions from the Town Hall.

1) Prompts and Fine-tuning Datasets

There are no restrictions on prompts or fine-tuning datasets.

Based on participant feedback, we recognize that some prompt-related constraints discussed during the Town Hall caused confusion and appeared inconsistent with the original competition rules.
To address this, we confirm that no additional constraints are imposed on prompt design or fine-tuning datasets.
Participants are free to design prompts and fine-tune their models in any manner they choose.

2) Tool Call Usage

Only a calculator tool is allowed.

Tool calls are restricted to the use of a calculator only.
The use of external tools such as web search, external APIs, or other third-party services is not permitted.
RAG (retrieval-augmented generation) and internal memory mechanisms are considered part of the agent’s internal architecture and are not classified as tool calls. Their use is allowed.

3) Hidden Test Cases and Game-specific Details

This section clarifies how hidden test cases may differ from the live evaluation environments.

2048

The board size may be extended to an arbitrary N×M grid.
This corrects an earlier response from the Town Hall Q&A. We appreciate your understanding.

Super Mario

The map layout may change in hidden test cases.

StarCraft II (SC2)

One or more of the following may change in hidden test cases:
- bot_race
- bot_difficulty
- bot_build
- map_idx

Pokémon

The existing seven milestones will not change.
However, the following aspects may vary:
- Map state coordinates
- Names of Pokémon, NPCs, and skills
Step definition
- One agent action corresponds to one step.
- The maximum number of steps remains 200.
Action constraints
- Creating new high-level actions beyond the provided predefined functions is not allowed.

We hope these clarifications reduce ambiguity and help participants focus on building robust and generalizable agents.
For further questions, please use the discussion channels.

Thank you for your participation.

ChoiSoojin · January 17, 2026, 8:48am

i trained RL agent on context extracted from Orak server, recorded the trajectories in file logs, then distilled knowledge from them to make part of RAG prompt. Is it considered valid or it “usage of walkthroughs” violation? It seems might be considered as some form of Memory , awhich is by the way the same mechanics as in LLM itself, because LLMs were trained on corpus of texts which include such “pokemon walkthroughs”, so restrictions on usage of walkthroughs, might be ambiguous.

howon_lee · January 20, 2026, 6:53am

Hi Soojin,

We discussed this internally and decided not to impose extra restrictions on RAG, prompts, or internal memory(including knowledge distilled from trajectory logs), since the “walkthrough” constraint mentioned in the townhall can be ambiguous in practice.

Instead, we’ll assess generalization via hidden test cases: methods that overfit to known scenarios won’t help if they don’t transfer to unseen evaluations.

heatz123 · February 2, 2026, 1:10am

Hi @howon_lee , does the “calculator” tool mean basic arithmetic only, or does it include running code (e.g., a Python interpreter)?

If we implement code execution or environment simulation inside the agent itself (as code-as-policies style approaches), is that allowed, or would that be considered equivalent to using a disallowed tool?

Thank you.

howon_lee · February 4, 2026, 7:49am

A Python interpreter (i.e., a general “run code” tool) is not allowed — tool usage is restricted to the calculator only.
Internal code execution inside your agent (e.g., planning/search, rule-based policies, internal simulation/rollouts) can be allowed as part of the agent implementation.
However, if your agent code includes anything that effectively bypasses the tool restrictions (e.g., embedding a general-purpose interpreter, calling external services / network access, hidden tool-like execution), it may be treated as a rules violation and could lead to disqualification.