📝 Final Evaluation Process

snehananavati · August 29, 2025, 4:34pm

Hello Participants,

Thank you for your efforts and valuable contributions to advancing dialogue systems through this challenge. This note outlines the final evaluation process and provides details on how the final scores were determined. You can find the leaderboard on the challenge page. and the top three winners for each track over here too.

Evaluation Process

Task 1

Automatic Evaluation

Final Score: Average of Function Score and BLEURT Score.
Function Score: Measures how well the submission matches the gold reference in terms of function names, argument names, and argument values. For argument values where exact matches are not required, semantic similarity is used. If similarity passes a threshold, it is counted as a match.
BLEURT Score: A reference-based metric with strong correlation to human evaluation (validated in CPDC2023). For this challenge, we confirmed the correlation again using results from the Warmup Round and Round 1.

Notes

No human evaluation is conducted for Task 1.
In case of a tie, the submission with the higher Function score is ranked higher.

Task 2

Automatic Evaluation

Final Score: Average of CPDC Score and BLEURT Score. Since the CPDC Score ranges from 1 to 5, it is normalized to 0–1 before averaging.
CPDC Score: An LLM-based evaluation method (see CPDC2023 paper for details). We confirmed high correlation with human evaluation. Due to the extensive knowledge involved in this challenge, the original CPDC prompt was modified.
BLEURT Score: Same as in Task 1. There was a concern that relying on only one would weaken the correlation as participants repeatedly improve their models. To address this, we combine BLEURT (reference-based) with CPDC (LLM-based) for a more stable metric.

Human Evaluation

The top nine* teams based on automatic evaluation were selected for human evaluation for API Track. The top eight teams based on automatic evaluation were selected for human evaluation for GPU Track.
Method: Two aspects are assessed – Response Quality and Knowledge Consistency.
Evaluators compare dialogue samples (7–10 turns) between a pair of teams. For each dialogue, each of the teams is assigned a win/lose/tie. It’s a round-robin tournament for 8 teams.*
Response Quality (Response Rank): Judged by three evaluators. Criteria include alignment with dialogue history, fluency, naturalness, human-likeness, and consistency with persona traits.
Knowledge Consistency (Knowledge Rank): Judged by two evaluators. Criteria include consistency with persona and world information, avoidance of fabrication, and accuracy regarding items, quests, time, and weather.
- We get one result from three evaluators for Response Quality, and another result from two evaluators for Knowledge Consistency. Not a single win/lose/tie from those five evaluators.

Notes
3. In case of a tie, the submission with the higher CPDC score is ranked higher.
4. If Task 2 ranks are tied, the one with the higher Knowledge Rank is ranked higher.
5. If Response or Knowledge Rank is tied, the one with the higher tie value is ranked higher.

Task 3

Automatic Evaluation

Final Score: Average of Task 1 and Task 2 automatic scores.

Human Evaluation

The top eight teams based on automatic evaluation were selected for human evaluation.
Final Score: Sum of ranks from Task 1 and Task 2.
If Task 2 is evaluated with human evaluation, we do not average it with Task 1’s automatic score. Instead, Task 1 is ranked based on its automatic score, and this rank is added to the human evaluation rank from Task 2.

Notes
6. In case of a tie, the submission with the higher Task 1 score is ranked higher.

References

BLEURT Score: https://aclanthology.org/2020.acl-main.704/
CPDC Score: https://arxiv.org/abs/2406.11228

Submit to Wordplay Workshop @ EMNLP 2025

We also invite you to take the next step: submit your work as a paper to the Wordplay Workshop @ EMNLP 2025 and share your ideas with the wider research community.

Turn your CPDC solution into a paper and showcase your work to the broader NLP, RL, and interactive narrative research communities.

Key Dates

Submission deadline: September 12, 2025, 23:59 (AoE)
Author notification: September 26, 2025
Workshop date: November 9, 2025
Submission link: OpenReview

Guidelines

Keyword: Include “CPDC” in your submission form
Team ID: Mention your CPDC team ID in the report
Page Limit: 4–8 pages (excluding references/supplementary materials), formatted to ACL style
Review: Light review; accepted reports will appear on the workshop website (non-archival) and not in EMNLP proceedings

Workshop site: /call_for_papers

We strongly encourage you to document and share your system design, approaches, and insights. This is a great opportunity to contribute to dialogue research and gain visibility within the EMNLP community.