Starter‑Kit Update – **`WebSearchResult` helper for full‑page retrieval**

We have added a small utility that makes it painless—and safe—to fetch complete web‑page contents for any result returned by the cragmm-search-pipeline.
Pull the latest version of the starter‑kit to pick up the changes described below.


1. What changed?

File Purpose
agents/rag_agent.py Each search hit is now wrapped in WebSearchResult, so you can access result["page_content"] directly.
crag_web_result_fetcher.py New helper class. Handles on‑demand download, local caching under ~/.cache/crag/web_search_results, and transparent access to all original fields (url, page_snippet, etc.).
docs/search_api.md New “Fetching the full page content” section with a copy‑paste example.

No existing APIs are removed or renamed.


2. Quick‑start

from cragmm_search.search import UnifiedSearchPipeline
from crag_web_result_fetcher import WebSearchResult

search = UnifiedSearchPipeline()
results = search("What to know about Andrew Cuomo?", k=2)

for hit in results:
    hit = WebSearchResult(hit)          # ← wrap once
    print(hit["page_content"][:500])    # full HTML, first 500 chars

That’s all—you do not need to call requests yourself.


3. How the helper works

  • First run (local development)

    • Downloads the page at hit["url"] and stores it in the cache directory
      ~/.cache/crag/web_search_results (override with CRAG_WEBSEARCH_CACHE_DIR).
  • Subsequent runs

    • Reads straight from cache—zero network overhead.
  • Evaluation phase

    • The same page_content field is pre‑populated in the evaluation container, so no download attempt is made and your code remains identical.

4. Guidelines & caveats

  1. Use only the URLs returned by cragmm-search-pipeline.
    Trying to fetch other sites will fail during evaluation (outbound internet is disabled).

  2. The local cache size is entirely up to you; clean or relocate it if needed.

  3. The helper exposes every original key plus the new page_content attribute:

    • hit["page_url"], hit["page_name"], hit["page_snippet"], hit["score"], …

5. Action required

  • git pull (or re‑clone) the starter‑kit.
  • Ensure your code wraps each search result in WebSearchResult before accessing page_content.
  • Run your usual tests to confirm everything works as expected.

If you hit any issues, open a thread in the forum and tag the organisers—happy hacking!


what about the image in the image api output, are such features available for images?