Starter‑Kit Update – `WebSearchResult` helper for full‑page retrieval

jyotish · May 7, 2025, 8:37am

We have added a small utility that makes it painless—and safe—to fetch complete web‑page contents for any result returned by the cragmm-search-pipeline.
Pull the latest version of the starter‑kit to pick up the changes described below.

1. What changed?

File	Purpose
`agents/rag_agent.py`	Each search hit is now wrapped in `WebSearchResult`, so you can access `result["page_content"]` directly.
`crag_web_result_fetcher.py`	New helper class. Handles on‑demand download, local caching under `~/.cache/crag/web_search_results`, and transparent access to all original fields (`url`, `page_snippet`, etc.).
`docs/search_api.md`	New “Fetching the full page content” section with a copy‑paste example.

No existing APIs are removed or renamed.

2. Quick‑start

from cragmm_search.search import UnifiedSearchPipeline
from crag_web_result_fetcher import WebSearchResult

search = UnifiedSearchPipeline()
results = search("What to know about Andrew Cuomo?", k=2)

for hit in results:
    hit = WebSearchResult(hit)          # ← wrap once
    print(hit["page_content"][:500])    # full HTML, first 500 chars

That’s all—you do not need to call requests yourself.

3. How the helper works

First run (local development)
- Downloads the page at hit["url"] and stores it in the cache directory
  ~/.cache/crag/web_search_results (override with CRAG_WEBSEARCH_CACHE_DIR).
Subsequent runs
- Reads straight from cache—zero network overhead.
Evaluation phase
- The same page_content field is pre‑populated in the evaluation container, so no download attempt is made and your code remains identical.

4. Guidelines & caveats

Use only the URLs returned by cragmm-search-pipeline.
Trying to fetch other sites will fail during evaluation (outbound internet is disabled).
The local cache size is entirely up to you; clean or relocate it if needed.
The helper exposes every original key plus the new page_content attribute:
- hit["page_url"], hit["page_name"], hit["page_snippet"], hit["score"], …

5. Action required

git pull (or re‑clone) the starter‑kit.
Ensure your code wraps each search result in WebSearchResult before accessing page_content.
Run your usual tests to confirm everything works as expected.

If you hit any issues, open a thread in the forum and tag the organisers—happy hacking!

yikuan_xia · May 7, 2025, 8:55am

what about the image in the image api output, are such features available for images?

yilun_jin8 · May 13, 2025, 1:08pm

Hi, please check out our latest announcement here

Starter‑Kit Update – **`WebSearchResult` helper for full‑page retrieval**