We have added a small utility that makes it painless—and safe—to fetch complete web‑page contents for any result returned by the cragmm-search-pipeline
.
Pull the latest version of the starter‑kit to pick up the changes described below.
1. What changed?
File | Purpose |
---|---|
agents/rag_agent.py |
Each search hit is now wrapped in WebSearchResult , so you can access result["page_content"] directly. |
crag_web_result_fetcher.py |
New helper class. Handles on‑demand download, local caching under ~/.cache/crag/web_search_results , and transparent access to all original fields (url , page_snippet , etc.). |
docs/search_api.md |
New “Fetching the full page content” section with a copy‑paste example. |
No existing APIs are removed or renamed.
2. Quick‑start
from cragmm_search.search import UnifiedSearchPipeline
from crag_web_result_fetcher import WebSearchResult
search = UnifiedSearchPipeline()
results = search("What to know about Andrew Cuomo?", k=2)
for hit in results:
hit = WebSearchResult(hit) # ← wrap once
print(hit["page_content"][:500]) # full HTML, first 500 chars
That’s all—you do not need to call requests
yourself.
3. How the helper works
-
First run (local development)
- Downloads the page at
hit["url"]
and stores it in the cache directory
~/.cache/crag/web_search_results
(override withCRAG_WEBSEARCH_CACHE_DIR
).
- Downloads the page at
-
Subsequent runs
- Reads straight from cache—zero network overhead.
-
Evaluation phase
- The same
page_content
field is pre‑populated in the evaluation container, so no download attempt is made and your code remains identical.
- The same
4. Guidelines & caveats
-
Use only the URLs returned by
cragmm-search-pipeline
.
Trying to fetch other sites will fail during evaluation (outbound internet is disabled). -
The local cache size is entirely up to you; clean or relocate it if needed.
-
The helper exposes every original key plus the new
page_content
attribute:-
hit["page_url"]
,hit["page_name"]
,hit["page_snippet"]
,hit["score"]
, …
-
5. Action required
-
git pull
(or re‑clone) the starter‑kit. - Ensure your code wraps each search result in
WebSearchResult
before accessingpage_content
. - Run your usual tests to confirm everything works as expected.
If you hit any issues, open a thread in the forum and tag the organisers—happy hacking!