Final Evaluation Process & Team Scores

Hello Participants,

We want to thank all participants for their efforts and contributions to improving RAG systems. This post provides information about the final evaluation process and shares the final scores.

Manual annotation:

For stage 2 in this challenge, we first conducted auto-evaluation for all the teams that provided submission numbers in Phase 2. We then selected the top 15 teams’ submissions according to the auto-eval scores for manual evaluation. The final scores are determined by the scores calculated from the manual grading labels. Automatic and manual evaluation details can be found in the paper.

Weighting:

We applied traffic weights to the questions to understand the solutions in real-world use cases. The traffic weights come from a real QA use case and are generated as follows. Within each domain, we first clustered the questions into question types with the exact definition of the CRAG questions. Then, we derived the weights for each type based on aggregated data reflective of user interactions. We applied the weight to each CRAG question to bridge the result to reflect user experience and reported the macro average scores across all domains (i.e., giving the same weight to all domains).

Code validation

We conducted a code review for the winning solutions to ensure the validity of the codes. For example, we checked whether there was any prompt attack to mislead the auto-evaluation and whether the solutions used too many hard-coded answers.

The scores from the winning teams are listed below.

Task Team Score
Task 1 db3 28.40%
md_dh 24%
ElectricSheep 21.8%
Task 2 db3 42.7%
APEX 41.0%
md_dh 31.0%
Task 3 db3 47.8%
APEX 44.9%
vslyu-team 25.6%

The scores from the winning teams are listed below.

Task Question Type Team Score
Task 1 simple_w_condition dummy_model 17.9
set dummy_model 21.25
comparison dRAGonRAnGers 37
aggregation dummy_model 21.5
multi_hop bumblebee7 16.8
post_processing dRAGonRAnGers 8.6
false_premise ETSLab 65.2
Task 2 simple_w_condition ElectricSheep 23.9
set ElectricSheep 36.65
comparison dRAGonRAnGers 38
aggregation ElectricSheep 18.75
multi_hop ElectricSheep 23.2
post_processing ElectricSheep 11.75
false_premise Future 64.6
Task 3 simple_w_condition StarTeam 42.2
set md_dh 31.7
comparison dRAGonRAnGers 37.25
aggregation md_dh 26.6
multi_hop ETSLab 25.7
post_processing md_dh 8.3
false_premise Riviera4 72.2

We extend our sincere gratitude to all participants who contributed to this event!

All the best,

Meta & AIcrowd

4 Likes

Can we obtain the full rankings for the main 3 tasks? At least I want to understand how far I am away from the top teams.

3 Likes

Is it possible to share how to evaluate the trained/submitted models ?

The evaluation methods are also interesting and meaningful

1 Like

There are no manual scores, only auto-scores.