Remix.run Logo
michaelmior 4 hours ago

> I'm surprised that Cloudflare hasn't started hosting a pre-scraped version of websites that use Cloudflare's proxy

It's entirely possible that they're doing this under the hood for cases where they can clearly identify the content they have cached is public.

janalsncm 3 hours ago | parent | next [-]

How would they know the content hasn’t changed without hitting the website?

coreq an hour ago | parent | next [-]

They wouldn't, well there's Etag and alike but it still a round trip on level 7 to the origin. However the pattern generally is to say when the content is good to in the Response headers, and cache on that duration, for an example a bitcoin pricing aggregator might say good for 60 seconds (with disclaimers on page that this isn't market data), whilst My Little Town news might say that an article is good for an hour (to allow Updates) and the homepage is good for 5 minutes to allow breaking news article to not appear too far behind.

OptionOfT 33 minutes ago | parent | prev [-]

Caching headers?

(Which, on Akamai, are by default ignored!)

binarymax 4 hours ago | parent | prev [-]

Based on the post, it seems likely that they'd just delay per the robots.txt policy no matter what, and do a full browser render of the cached page to get the content. Probably overkill for lots and lots of sites. An HTML fetch + readability is really cheap.