| ▲ | jasongill 5 hours ago | ||||||||||||||||||||||||||||
I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy - something like https://www.example.com/cdn-cgi/cached-contents.json They already have the website content in their cache, so why not just cut out the middle man of scraping services and API's like this and publish it? Obviously there's good reasons NOT to, but I am surprised they haven't started offering it (as an "on-by-default" option, naturally) yet. | |||||||||||||||||||||||||||||
| ▲ | michaelmior 4 hours ago | parent | next [-] | ||||||||||||||||||||||||||||
> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public. | |||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||
| ▲ | selcuka 3 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
Not the same thing, but they have something close (it's not on-by-default, yet) [1]: > Cloudflare's network now supports real-time content conversion at the source, for enabled zones using content negotiation headers. Now when AI systems request pages from any website that uses Cloudflare and has Markdown for Agents enabled, they can express the preference for text/markdown in the request. Our network will automatically and efficiently convert the HTML to markdown, when possible, on the fly. | |||||||||||||||||||||||||||||
| ▲ | cmsparks 4 hours ago | parent | prev | next [-] | ||||||||||||||||||||||||||||
That would prolly work for simple sites, but you still need the dedicated scraping service with a browser to render sites that are more complex (i.e. SPAs) | |||||||||||||||||||||||||||||
| ▲ | csomar 4 hours ago | parent | prev [-] | ||||||||||||||||||||||||||||
It’s a bit more complicated than that. This is their product Browser Rendering, which runs a real browser that loads the page and executes JavaScript. It’s a bit more involved than a simple curl scraping. | |||||||||||||||||||||||||||||