Clarification about the evaluation process

Camaro · May 6, 2025, 11:42am

Regarding participation eligibility, is my understanding correct?

Phase 1: All teams can participate
Phase 2: Only teams that successfully submit in Phase 1 can participate
Final Round: Only the top 10 teams in phase 2 based on automatic evaluation can participate

How is the final submission selected? Can we change from the best leaderboard submission?
Is there no length limit for the final evaluation? (The limit is 75 tokens for automatic evaluation)

Full responses are manually checked for hallucinations.
How is the generation of the first token detected?

A 10-second timeout starts after the first token is generated.
How is time per sample measured in the batch generation pipeline?

Only answer texts generated within 30 seconds are considered.
If we exceed the time limit, will we be immediately disqualified? Or just the sample will be considered as wrong (or missing)?
Is a missing answer required to be an exact match to “I don’t know,” or are similar responses acceptable in manual evaluation? Which of the following statements is correct?

Missing (e.g., “I don’t know”, “I’m sorry I can’t find …”) → Score: 0.0

All missing answers should return a standard response: “I don’t know.”

Camaro · May 7, 2025, 12:29pm

@yilun_jin8 @mohanty
can you check these questions?

Camaro · May 9, 2025, 12:43pm

@yilun_jin8
Any updates? If there are some questions you can’t answer, please let me know so.
Thanks.

Camaro · May 12, 2025, 9:53pm

@yilun_jin8

To be honest, I don’t really understand why you replied, made changes, then deleted your response and are now staying silent about this post.
If a certain question can’t be answered, that’s totally fine. Please just let me know.

yilun_jin8 · May 13, 2025, 5:11am

Hi.

I agree that your questions are highly pertinent, and should be answered as soon as possible.

However, I honestly cannot answer these questions and would have to raise these questions to the organizers from Meta. We have summarized your questions, and have raised them to the organizers multiple times. However, we have not received any reply from them yet.

I can understand your anxiety in the ambiguity of the rules, and I would try my best to communicate to the organizers to get answers. The answers to your questions would benefit all the participants.

Yilun.

Camaro · May 13, 2025, 10:20am

I see, thank you for the clarification.
I appreciate you reaching out to the organizers. I will proceed on the assumption that we may not receive a response.

Thanks!

yilun_jin8 · May 18, 2025, 4:18pm

Hi @Camaro
We got responses to some of your questions.

Eligibility: Your interpretations are correct. Specifically, the ‘top 10 teams’ include top 10 teams on both the ‘all’ and the ‘ego’ leaderboard.
Final Submission Selection: By default we select the best submission based on automatic evaluation results. Participants can request to use a different submission for final (human) evaluation. Most likely, we will send out a spreadsheet to all ‘top 10’ teams and ask them to select.
Length Limit: For the final (human) evaluation, we will truncate the response in the same way as the automatic evaluations.
Timeout after first token: This is not finalized yet.
Time per sample: This is not finalized yet.
Outcome of time limit exceed: This is not finalized yet.
Missing answers: Missing answers have to be exactly ‘I don’t know’.

Hopefully these answers help. They will also be updated to the rules.

Camaro · May 18, 2025, 10:24pm

Thanks for the clarification!

yilun_jin8 · May 20, 2025, 8:00am

Hi,

We get some additional clarifications regarding the time limit.

We will abandon the ‘10s after first token’ and ‘30s timeout’ rule. Instead, we will apply a simpler limit of 10s per turn. Under batch generation, we will give 10s * batch_size time for each batch.

Violating the time limits will result in failure of the whole submission.