Final Evaluation Process & Team Scores

Hello Participants,

We want to thank all participants for their efforts and contributions. This post provides information about the final evaluation process and shares the final scores.

Aggregation:

For the grand prize and special awards, we aggregated performance across Task 1, Task 2, and Task 3 without applying any weighting. Each interaction, regardless of task, was treated equally in the final scoring process. This approach ensures a fair comparison across different solution strategies.

Missing Rates:

We reviewed the missing rates of all submissions selected for final evaluation. The missing rates for the winning solutions were within acceptable limits. As a result, no top-performing teams were disqualified due to missing data.

Task Scores

Task Team Submission ID Score
Task 1 Dianping-Trust-Safety 289765 12.8%
db3 289693 8.4%
cruise 289794 6.7%
Task 2 Team_NVIDIA 289355 23.3%
db3 289788 22.1%
AcroYAMALEX 289902 21.4%
Task 3 db3 289655 36.8%
BlackPearl 288641 30.9%
Dianping-Trust-Safety 288234 29.7%
All egocentric images db3 289693, 289788, 289655 21.0%

Question Type Scores

Question Type Team Submission ID (task 1, task 2, task 3) Score
Simple NEC_AI_ROCKETS 289393, 288443, 289337 15.9%
Multi-hop otonadake 289524, 286760, 286785 5.9%
Comparison and Aggregation gogoogo 288599, 287800, 288715 3.3%
Reasoning otonadake 289524, 286760, 286785 10.3%

Team Meta CRAG-MM

@snehananavati @yilun_jin @Jiaqi
Thank you for sharing the detail result.
This is tereka team member of AcroYAMALEX

my submission ID 289902 is Task3 submission ID.
so I check more other teams, 289355(Team_NVIDIA) is task3, 289655(db3) is task2

Do you exchange result task2 and task3?

Could the organizer confirm whether there is a possibility of confusing the results of task2 and task3?

For task3, the commit id 288641 seems to be our team’s achievement (Dianping-Trust-Safety)

@tereka @l0wang

Can you clarify what you are referring to as task2 and task3?
We did noticed teams are confusing with task 2 and task 3 when submitting the google form.

Task 2 is multi-source augmentation
Task 3 is multi-turn QA

Task 2 and 3 have different number of interactions, so it’s unlikely these two are mis-placed as the total counts did match.

@l0wang that’s for catching this. The submission_id was wrong, but scores and rankings are correct.

This is the correct mapping. We will update the table soon.
BlackPearl, 288234 → 30.9%
Dianping-Trust-Safety, 288641 → 29.7%

Thank you for reply.
I understand, I just misunderstand task2 and task3 task name.

1 Like