Final Evaluation Process & Team Scores

snehananavati · July 30, 2024, 6:14am

Hello Participants,

We want to thank all participants for their efforts and contributions to improving RAG systems. This post provides information about the final evaluation process and shares the final scores.

Manual annotation:

For stage 2 in this challenge, we first conducted auto-evaluation for all the teams that provided submission numbers in Phase 2. We then selected the top 15 teams’ submissions according to the auto-eval scores for manual evaluation. The final scores are determined by the scores calculated from the manual grading labels. Automatic and manual evaluation details can be found in the paper.

Weighting:

We applied traffic weights to the questions to understand the solutions in real-world use cases. The traffic weights come from a real QA use case and are generated as follows. Within each domain, we first clustered the questions into question types with the exact definition of the CRAG questions. Then, we derived the weights for each type based on aggregated data reflective of user interactions. We applied the weight to each CRAG question to bridge the result to reflect user experience and reported the macro average scores across all domains (i.e., giving the same weight to all domains).

Code validation

We conducted a code review for the winning solutions to ensure the validity of the codes. For example, we checked whether there was any prompt attack to mislead the auto-evaluation and whether the solutions used too many hard-coded answers.

The scores from the winning teams are listed below.

Task	Team	Score
Task 1	db3	28.40%
	md_dh	24%
	ElectricSheep	21.8%
Task 2	db3	42.7%
	APEX	41.0%
	md_dh	31.0%
Task 3	db3	47.8%
	APEX	44.9%
	vslyu-team	25.6%

The scores from the winning teams are listed below.

Task	Question Type	Team	Score
Task 1	simple_w_condition	dummy_model	17.9
	set	dummy_model	21.25
	comparison	dRAGonRAnGers	37
	aggregation	dummy_model	21.5
	multi_hop	bumblebee7	16.8
	post_processing	dRAGonRAnGers	8.6
	false_premise	ETSLab	65.2
Task 2	simple_w_condition	ElectricSheep	23.9
	set	ElectricSheep	36.65
	comparison	dRAGonRAnGers	38
	aggregation	ElectricSheep	18.75
	multi_hop	ElectricSheep	23.2
	post_processing	ElectricSheep	11.75
	false_premise	Future	64.6
Task 3	simple_w_condition	StarTeam	42.2
	set	md_dh	31.7
	comparison	dRAGonRAnGers	37.25
	aggregation	md_dh	26.6
	multi_hop	ETSLab	25.7
	post_processing	md_dh	8.3
	false_premise	Riviera4	72.2

We extend our sincere gratitude to all participants who contributed to this event!

All the best,

Meta & AIcrowd

wufanyou · July 30, 2024, 4:15pm

Can we obtain the full rankings for the main 3 tasks? At least I want to understand how far I am away from the top teams.

hong_cheng · July 31, 2024, 4:58am

Is it possible to share how to evaluate the trained/submitted models ？

The evaluation methods are also interesting and meaningful

snehananavati · August 5, 2024, 6:00am

There are no manual scores, only auto-scores.