I am curious if any good existing solution exist for this tool:
`Tool name: WebFetch
Tool description:
- Fetches content from a specified URL and processes it using an AI model
- Takes a URL and a prompt as input
- Fetches the URL content, converts HTML to markdown
- Processes the content with the prompt using a small, fast model
- Returns the model's response about the content
- Use this tool when you need to retrieve and analyze web content`
I came up with this one:
`import asyncio
from playwright.async_api import async_playwright
from readability import Document
from markdownify import markdownify as md
async def web_fetch_robust(url: str, prompt: str) -> str:
"""
Fetches content from a URL using a headless browser to handle JS-heavy sites,
processes it, and returns a summary.
"""
try:
async with async_playwright() as p:
# Launch a headless browser (Chromium is a good default)
browser = await p.chromium.launch()
page = await browser.new_page()
# --- Avoiding Blocks ---
# Set a realistic User-Agent to mimic a real browser
await page.set_extra_http_headers({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
})
# Navigate to the URL
await page.goto(url, wait_until='networkidle', timeout=15000) # wait_until='networkidle' is key
# --- Extracting Content ---
# Get the fully rendered HTML content
html_content = await page.content()
await browser.close()
# --- Processing for Token Minimization ---
# 1. Extract main content using Readability.js
doc = Document(html_content)
main_content_html = doc.summary()
# 2. Convert to clean Markdown
markdown_content = md(main_content_html, strip=['a', 'img']) # Strip links/images to save tokens
# 3. Use the small, fast model to process the clean content
# summary = small_model.process(prompt, markdown_content) # Placeholder for your model call
# For demonstration, we'll just return a message
summary = f"A summary of the JS-rendered content from {url} would be generated here."
return summary
except Exception as e:
return f"Error fetching or processing URL with headless browser: {e}"
# To run this async function
# result = asyncio.run(web_fetch_robust("https://example.com", "Summarize this."))
# print(result)
`