Is there any evaluation on data quality yet?

When reviewing the results, I noticed a few issues. I’m not sure if others have experienced the same problems, but if the accuracy of the Ground Truth isn’t at a certain level, it’s challenging to assess the performance of the RAG.

Here are a few I noticed from task 1 data

  1. In 2016, “Zootopia” won the award, while in 2017, it was “Coco,” as indicated by the information provided at Academy Award for Best Animated Feature Film — Full List.
    {
    “interaction_id”: “0b367b2b-e638-4f60-8422-dedcd9210410”,
    “query”: “in 2017, which animated film was honored with the best animated feature film oscar award?”,
    “answer”: “zootopia”,
    “question_type”: “simple”,
    “alternative_answers”: [],
    “split”: 0
    }

  2. There was no answer provided for the query regarding when IRRX last issued dividends to its shareholders.
    {
    “interaction_id”: “8ddcf0d9-f3be-4bba-b4c3-dc1fd65a480b”,
    “query”: “when did irrx last issue dividends to its shareholders?”,
    “answer”: “”,
    “question_type”: “simple”,
    “alternative_answers”: [],
    “split”: 1
    }

1 Like

see another one:
{
“interaction_id”: “b06f3622-6840-4595-b32b-c407d350758c”,
“query”: “what was the total points tally for indiana pacers in 2022-12, encompassing all games played during that month?”,
“answer”: “1841”,
“question_type”: “aggregation”,
“alternative_answers”: [],
“split”: 1
}
while the search_results provided are:
0 Must Watch Female Superhero Movies https://www.cbr.com/female-superhero-movies-dc-marvel/

1 10 Great Superhero Films with Female Leads https://movieweb.com/great-superhero-films-with-female-leads/

2 Sheena and Sonja, Diana and Danvers: A Box Office History of the … Sheena and Sonja, Diana and Danvers: A Box Office History of the Female-Led Superhero Movie - Boxoffice

3 Evolution of the Female Action Hero: Photos Evolution of the Female Action Hero: Photos

4 Sheena and Sonja, Diana and Danvers: A Box Office History of the … Sheena and Sonja, Diana and Danvers: A Box Office History of the Female-Led Superhero Movie - Boxoffice

1 Like

Hello xiwei_zhou,

For 0b367b2b-e638-4f60-8422-dedcd9210410, the 89th Academy award for Animated Feature film was actually won by Zootopia in 2017. Please see https://www.oscars.org/oscars/ceremonies/2017 for the reference.

For 8ddcf0d9-f3be-4bba-b4c3-dc1fd65a480b, it’s indeed a ground truth error. We updated the data to V2. V2 have some low quality questions replaced and fixed ground truth errors like this one. Please update the data when you can. In case you notice any other issue, please feel free to let us know. :slight_smile:

For b06f3622-6840-4595-b32b-c407d350758c, it happens that sometime the web search result is not quite relevant to the question due to various issue. We consider such cases also as practical scenarios for real RAG systems.

Best Regards,
The CRAG Team

Thanks, Graceyx.yale.

I saw the V2 data, will test it out.

I think 0b367b2b-e638-4f60-8422-dedcd9210410 is an ambiguity item. The question “in 2017, which animated film was honored with the best animated feature film oscar award?” could refer to film released from 2017, or oscar award happened in 2017(released in 2016).
In the link [Academy Award for Best Animated Feature Film — Full List (Academy Award for Best Animated Feature Film — Full List).

THE 89TH ACADEMY AWARDS | 2017

Dolby Theatre at the Hollywood & Highland Center

Sunday, February 26, 2017

Honoring movies released in 2016

With ambiguity question and all context available, answer produced by any model will be either random or biased from training data.

For b06f3622-6840-4595-b32b-c407d350758c, if context don’t have right data, how it suppose to generate right answer, are we expect the LLM has the knowledge to answer, or if this answer will be in training data of the LLM?

Two more questions on evaluation:

  1. I noticed some of entry don’t have answer, like
    {
        "interaction_id": "d03b2ded-1395-4593-9aa3-92b6c4bd1c3b",
        "query": "would you happen to know the price-to-earnings ratio for psf?",
        "answer": "",
        "question_type": "simple",
        "alternative_answers": [],
        "split": 0
    }

does the prediction needs to be “” to be correct?

  1. In the local_evaluate, it does not send the “query_time” to the evaluation, program only receive query and search_results, will this be change?
    Lots of question is time related, without this information, it’s a random guess.

Hi xiwei_zhou,

0b367b2b-e638-4f60-8422-dedcd9210410 should not ambiguous since it asked “in 2017 … was honored”.

The empty error issue should be addressed in v2. In the evaluation, it doesn’t need the query_time since the answer is provided. But you can decide how to use the query_time during inference.

Hope this helps.

Best Regards,
The CRAG Team

The “in 2017 … was honored” make sense.

V2 data looks much better, quick scan only have 3 empty answer entry as below:

088e4793-67cf-4b77-86ab-b591969ce954 who was president chester a. arthur’s vice president?
491777d3-ad18-462b-9194-e361b4a1b040 on which day did the gdev inc. warrant distribute dividends in the last year?
4633c075-9828-4fad-9251-ba5302fea523 how much was the last dividend from investcorp india acquisition corp. warrant?

1 Like

I have also noticed some discrepancy in the data. For example:

{‘interaction_id’: ‘d5e6eeb0-103d-4917-8be8-5b8e82d30ae4’,
‘query_time’: Timestamp(‘2024-03-10 21:34:45’),
‘domain’: ‘open’,
‘question_type’: ‘post-processing’,
‘static_or_dynamic’: ‘slow-changing’,
‘query’: ‘how many daughters do ryan reynolds and blake have?’,
‘answer’: ‘four’,
‘alternative_answers’: [],
}

They have 3 daughters and not 4. Will add more to the list :stuck_out_tongue:

query_time is indeed very important information (especially, as you said, for those time-related questions, and for real-world scenarios). I don’t know why we can’t feed it into genetate_answer() method during evaluation.

1 Like

Hello Graceyx.yale,

Can you check the post: Data Quality Collection V2, Task-1 ?

It shows certain category data is not in good shape, I randomly pick up 20 dataset for music & aggregation, the first 3 I have all have quality issue, which I hope is a coincident.

Thanks.

Hi nutansahoo,

Thank you for calling this out. The query_time is added to the generate_answer interface for inference. Please refer to here for more details.

Thanks,
The CRAG Team