Examples from task 1

fersebasIn · May 21, 2025, 9:09am

How can we identify which images and ground truth data in the single-turn dataset are exclusively related to Task 1?

Thanks

yilun_jin8 · May 21, 2025, 11:07am

Hi Seb,

Task 1 and task 2 use the same dataset (i.e. the single turn dataset). The only difference is the allowed search source.

fersebasIn · May 21, 2025, 11:18am

Yes but how can we distinguish between task 1 and task 2 the samples? or the ground truth? which one is neccesary the search source?

yilun_jin8 · May 21, 2025, 11:32am

I don’t quite understand your question. Could you please clarify?

fersebasIn · May 21, 2025, 11:36am

Sure, sorry.

Out of the 1.55k images, which ones require a web search to correctly answer the ground truth, and which ones do not? The highlighted example with session_id: 7b23bff8-7f17-41ee-8832-80ba361060ed contains the following question:

“Can I put batteries into the left bin?”

And the corresponding ans_full is:

“No, no states allow batteries to be put into recycling bins.”

This example does not require a web search, because:

The answer is based on general knowledge (common regulations across the U.S. or EU).
It doesn’t depend on specific visual details of the bin (though the image helps with context).
The ground truth provides a clear, factual statement that a general-purpose LLM could know.

So even though an image is provided, the question and answer are not visually grounded and do not require up-to-date or external information.

How can we evaluate only images for task 1 :)?