Hello Participants,
We want to thank all participants for their efforts and contributions to improving RAG systems. This post provides information about the final evaluation process and shares the final scores.
Manual annotation:
For stage 2 in this challenge, we first conducted auto-evaluation for all the teams that provided submission numbers in Phase 2. We then selected the top 15 teams’ submissions according to the auto-eval scores for manual evaluation. The final scores are determined by the scores calculated from the manual grading labels. Automatic and manual evaluation details can be found in the paper.
Weighting:
We applied traffic weights to the questions to understand the solutions in real-world use cases. The traffic weights come from a real QA use case and are generated as follows. Within each domain, we first clustered the questions into question types with the exact definition of the CRAG questions. Then, we derived the weights for each type based on aggregated data reflective of user interactions. We applied the weight to each CRAG question to bridge the result to reflect user experience and reported the macro average scores across all domains (i.e., giving the same weight to all domains).
Code validation
We conducted a code review for the winning solutions to ensure the validity of the codes. For example, we checked whether there was any prompt attack to mislead the auto-evaluation and whether the solutions used too many hard-coded answers.
The scores from the winning teams are listed below.
Task | Team | Score |
---|---|---|
Task 1 | db3 | 28.40% |
md_dh | 24% | |
ElectricSheep | 21.8% | |
Task 2 | db3 | 42.7% |
APEX | 41.0% | |
md_dh | 31.0% | |
Task 3 | db3 | 47.8% |
APEX | 44.9% | |
vslyu-team | 25.6% |
The scores from the winning teams are listed below.
Task | Question Type | Team | Score |
---|---|---|---|
Task 1 | simple_w_condition | dummy_model | 17.9 |
set | dummy_model | 21.25 | |
comparison | dRAGonRAnGers | 37 | |
aggregation | dummy_model | 21.5 | |
multi_hop | bumblebee7 | 16.8 | |
post_processing | dRAGonRAnGers | 8.6 | |
false_premise | ETSLab | 65.2 | |
Task 2 | simple_w_condition | ElectricSheep | 23.9 |
set | ElectricSheep | 36.65 | |
comparison | dRAGonRAnGers | 38 | |
aggregation | ElectricSheep | 18.75 | |
multi_hop | ElectricSheep | 23.2 | |
post_processing | ElectricSheep | 11.75 | |
false_premise | Future | 64.6 | |
Task 3 | simple_w_condition | StarTeam | 42.2 |
set | md_dh | 31.7 | |
comparison | dRAGonRAnGers | 37.25 | |
aggregation | md_dh | 26.6 | |
multi_hop | ETSLab | 25.7 | |
post_processing | md_dh | 8.3 | |
false_premise | Riviera4 | 72.2 |
We extend our sincere gratitude to all participants who contributed to this event!
All the best,
Meta & AIcrowd