Hello Participants,
Thank you for your efforts and valuable contributions to advancing dialogue systems through this challenge. This note outlines the final evaluation process and provides details on how the final scores were determined. You can find the leaderboard on the challenge page. and the top three winners for each track over here too.
Evaluation Process
Task 1
Automatic Evaluation
- Final Score: Average of Function Score and BLEURT Score.
- Function Score: Measures how well the submission matches the gold reference in terms of function names, argument names, and argument values. For argument values where exact matches are not required, semantic similarity is used. If similarity passes a threshold, it is counted as a match.
- BLEURT Score: A reference-based metric with strong correlation to human evaluation (validated in CPDC2023). For this challenge, we confirmed the correlation again using results from the Warmup Round and Round 1.
Notes
- No human evaluation is conducted for Task 1.
- In case of a tie, the submission with the higher Function score is ranked higher.
Task 2
Automatic Evaluation
- Final Score: Average of CPDC Score and BLEURT Score. Since the CPDC Score ranges from 1 to 5, it is normalized to 0β1 before averaging.
- CPDC Score: An LLM-based evaluation method (see CPDC2023 paper for details). We confirmed high correlation with human evaluation. Due to the extensive knowledge involved in this challenge, the original CPDC prompt was modified.
- BLEURT Score: Same as in Task 1. There was a concern that relying on only one would weaken the correlation as participants repeatedly improve their models. To address this, we combine BLEURT (reference-based) with CPDC (LLM-based) for a more stable metric.
Human Evaluation
- The top eight teams based on automatic evaluation in Task 2 and Task 3 were selected for human evaluation.
- Method: Two aspects are assessed β Response Quality and Knowledge Consistency.
- Evaluators compare dialogue samples (7β10 turns) between a pair of teams. For each dialogue, each of the teams is assigned a win/lose/tie. Itβs a round-robin tournament for 8 teams.*
- Response Quality (Response Rank): Judged by three evaluators. Criteria include alignment with dialogue history, fluency, naturalness, human-likeness, and consistency with persona traits.
- Knowledge Consistency (Knowledge Rank): Judged by two evaluators. Criteria include consistency with persona and world information, avoidance of fabrication, and accuracy regarding items, quests, time, and weather.
-
- We get one result from three evaluators for Response Quality, and another result from two evaluators for Knowledge Consistency. Not a single win/lose/tie from those five evaluators.
Notes
3. In case of a tie, the submission with the higher CPDC score is ranked higher.
4. If Task 2 ranks are tied, the one with the higher Knowledge Rank is ranked higher.
5. If Response or Knowledge Rank is tied, the one with the higher tie value is ranked higher.
Task 3
Automatic Evaluation
- Final Score: Average of Task 1 and Task 2 automatic scores.
Human Evaluation
- The top eight teams based on automatic evaluation in Task 2 and Task 3 were selected for human evaluation.
- Final Score: Sum of ranks from Task 1 and Task 2.
- If Task 2 is evaluated with human evaluation, we do not average it with Task 1βs automatic score. Instead, Task 1 is ranked based on its automatic score, and this rank is added to the human evaluation rank from Task 2.
Notes
6. In case of a tie, the submission with the higher Task 1 score is ranked higher.
References
- BLEURT Score: https://aclanthology.org/2020.acl-main.704/
- CPDC Score: https://arxiv.org/abs/2406.11228
Submit to Wordplay Workshop @ EMNLP 2025
We also invite you to take the next step: submit your work as a paper to the Wordplay Workshop @ EMNLP 2025 and share your ideas with the wider research community.
Turn your CPDC solution into a paper and showcase your work to the broader NLP, RL, and interactive narrative research communities.
Key Dates
- Submission deadline: September 12, 2025, 23:59 (AoE)
- Author notification: September 26, 2025
- Workshop date: November 9, 2025
- Submission link: OpenReview
Guidelines
- Keyword: Include βCPDCβ in your submission form
- Team ID: Mention your CPDC team ID in the report
- Page Limit: 4β8 pages (excluding references/supplementary materials), formatted to ACL style
- Review: Light review; accepted reports will appear on the workshop website (non-archival) and not in EMNLP proceedings
Workshop site: /call_for_papers
We strongly encourage you to document and share your system design, approaches, and insights. This is a great opportunity to contribute to dialogue research and gain visibility within the EMNLP community.