Data schema have differences between example_data and real data(task1&2)

search response item in example_data/web.json has the following schema:
[‘page_name’, ‘page_url’, ‘page_snippet’, ‘page_result’, ‘page_last_modified’, ‘search_result_type’]
and ‘search_result_type’ indicates value of ‘page_result’ is Web Content(html) or Snippet.

but in crag_task_1_v1.jsonl dataset.
search response item only has [‘page_name’, ‘page_url’, ‘page_snippet’, ‘page_result’, ‘page_last_modified’]

Differences:

  1. search response items in crag_task_1_v1.jsonl dataset are lack of ‘search_result_type’ argument.
  2. items in crag_task_1_v1.jsonl, Web Content(html) only consist in ‘page result’ and Snippet only consist in ‘page_snippet’?
  3. in crag_task_1_v1 dataset, there are some items only have Web Content (html) and the length of Snippet is zero (line 5, item 5 in this dataset), while some items only have Snippet and the length of html is zero (line 4, item 3 in this dataset).

My questions:

  1. Should I use the argument ‘search_result_type’ to judge which type of content is in ‘page_result’?
  2. Which format should I use, example_data or crag_task_1_v1?

@vk11: You are correct. the included example data does not follow the updated dataset schema. We will correct the same soon.

In the meantime, you can find the updated dataset schema at: docs/dataset.md · master · AIcrowd / Challenges / Meta Comprehensive RAG Benchmark - KDD Cup 2024 / Meta Comphrehensive RAG Benchmark starter kit · GitLab

Regarding the information that will be available to your model, your submission will only have access to the query and the search_results, as described here: models/dummy_model.py · master · AIcrowd / Challenges / Meta Comprehensive RAG Benchmark - KDD Cup 2024 / Meta Comphrehensive RAG Benchmark starter kit · GitLab

Best of luck

Hello @aicrowd_team,

your submission will only have access to the query and the search_results

Won’t you provide field query_time ? Will it include exact time and timezone ?

Based on the 10 provided samples and the API results (endpoint finance/get_price_history), the answer always match 2024-02-14 at “Close”. However I am a bit confused because for the subset of HTML files I looked, the query date is always 2024-02-16.

Hi @simon_jegou ,

We have updated the starter kit to include the example dev data in the same schema as the released data. The local evaluation scripts have been updated as well to work seamlessly with the new schema.

Best of luck