Team db3 consists of third-year PhD students from Peking University, mentored by Professor Gao Jun. Their research focuses on data mining for structured data, including community search, graph alignment, and table data integration. With a strong background in leveraging data mining for extracting insights in fields such as social networks and bioinformatics, the team’s expertise is especially pertinent to their work with large language models (LLMs) and Retrieval-Augmented Generation (RAG) systems.
Winning Strategy:
Team db3 excelled in all three tasks of the Meta KDD Cup 2024, securing first place with scores of 28.4%, 42.7%, and 47.8%, respectively. Their approach to creating a state-of-the-art RAG system involved several sophisticated techniques:
- Task 1 - Web Retrieval and Answering:
The team developed a framework utilizing a combination of retrievers and rerankers to process and rank text chunks extracted from web pages. They employed BeautifulSoup for HTML parsing and LangChain for text splitting, alongside the bge-base-en-v1.5 retriever and a complementary reranker model to refine the selection of relevant text chunks.
- Tasks 2 and 3 - Integration of Structured Data:
For the subsequent tasks, the team focused on integrating data from both web sources and mock Knowledge Graphs. They implemented a regularized API set and an API generation method using a tuned LLM. A Parent-Child Chunk Retriever system was crucial in managing the retrieval process, with the reranker further refining data selection to enhance accuracy and relevance.
- Addressing Hallucination in LLMs:
A significant aspect of their strategy was tuning the models to reduce inaccuracies and improve groundedness in responses, thereby addressing the issue of hallucination commonly associated with LLMs.
Impact and Research Alignment:
Team db3’s method demonstrates the potential of RAG systems in providing accurate and reliable answers by effectively integrating and processing external information. Their strategy aligns seamlessly with their research interests in data mining and structured data analysis, particularly their focus on knowledge graphs, which represent information in a structured format essential for various applications like semantic search and intelligent personal assistants.