- Regarding participation eligibility, is my understanding correct?
- Phase 1: All teams can participate
- Phase 2: Only teams that successfully submit in Phase 1 can participate
- Final Round: Only the top 10 teams in phase 2 based on automatic evaluation can participate
-
How is the final submission selected? Can we change from the best leaderboard submission?
-
Is there no length limit for the final evaluation? (The limit is 75 tokens for automatic evaluation)
Full responses are manually checked for hallucinations.
-
How is the generation of the first token detected?
A 10-second timeout starts after the first token is generated.
-
How is time per sample measured in the batch generation pipeline?
Only answer texts generated within 30 seconds are considered.
-
If we exceed the time limit, will we be immediately disqualified? Or just the sample will be considered as wrong (or missing)?
-
Is a missing answer required to be an exact match to “I don’t know,” or are similar responses acceptable in manual evaluation? Which of the following statements is correct?
Missing (e.g., “I don’t know”, “I’m sorry I can’t find …”) → Score: 0.0
All missing answers should return a standard response: “I don’t know.”