You’re on vacation, strolling through ancient sites as your smart glasses share their history. Later, at a local restaurant, they translate the menu, helping you order with confidence. As the day winds down, you head back to the parking lot—no searching, no stress—your glasses pull up an image reminder of exactly where you parked.
Wearable devices are revolutionizing how we communicate, work, and experience the world. But to be truly valuable in everyday life, they must provide relevant, accurate, and reliable information tailored to users’ needs.
Introducing the Meta CRAG-MM Challenge
Comprehensive RAG Benchmark for Multi-modal, Multi-turn Question Answering
Why CRAG-MM Matters
Vision-Language Large Models (VLLMs) have made significant strides, powering visual question answering (VQA) systems at the heart of smart glasses. Yet, they still suffer from hallucinations—particularly when responding to long-tail or complex queries that require integrating multiple capabilities: recognition, OCR, external knowledge, and generation.
Retrieval-Augmented Generation (RAG) offers a promising solution. Multi-modal RAG (MM-RAG) systems synthesise information from both images and questions, retrieve from external sources, and generate grounded answers. But evaluating such systems remains a challenge—until now.
What is CRAG-MM?
CRAG-MM is a visual question-answering benchmark built for the real-world complexities of wearable devices. It includes:
- A diverse collection of 5,000 images, including 3,000 egocentric images captured via Ray-Ban Meta smart glasses.
- Coverage across 14 domains and 4 question types, ranging from direct image-based queries to complex reasoning and multi-hop questions.
- Both single-turn and multi-turn conversations for a complete evaluation of MM-RAG solutions.
Challenge Tasks
We define four key question types:
- Simple Recognition: Answered directly from the image (e.g., “What brand is this milk?”).
- Simple Knowledge: Requires external sources (e.g., “What’s the price of this sofa on Amazon?”).
- Multi-hop: Requires chaining facts (e.g., “What other films has this director made?”).
- Comparison & Reasoning: Involves comparison or inference (e.g., “Is this cheaper on Amazon?”, “Can this dryer be used in Europe?”).
Three tasks are designed to evaluate different capabilities:
- Task 1: Single-turn QA using image-KG retrieval
- Task 2: Single-turn QA with added web retrieval
- Task 3: Multi-turn QA evaluating conversational depth
KDD Cup 2025
The Meta CRAG-MM Challenge is part of KDD Cup 2025, the premier data mining competition by ACM SIGKDD.
Location: Toronto, Canada
Dates: 3–7 August 2025
Challenge Timeline |
Prizes
Join the conversation on Discord – Connect with fellow participants, stay updated, and introduce yourself! |
Form a team |
Have feedback or questions?
*NO PURCHASE NECESSARY TO ENTER/WIN. Open to individuals 18+ and meeting all eligibility criteria. Competition open from 6 March 2025 to 1 June 2025. Void where prohibited. Subject to official rules at AIcrowd | Meta CRAG-MM | See rules for prize details.