| ▲ | radium3d 3 hours ago | |
Instead of "should have been an email" this is "should have been a prompt" and can be run locally instead. There are a number of ways to do this from a linux terminal. ``` write a custom crawler that will crawl every page on a site (internal links to the original domain only, scroll down to mimic a human, and save the output as a WebP screenshot, HTML, Markdown, and structured JSON. Make it designed to run locally in a terminal on a linux machine using headless Google Chrome and take advantage of multiple cores to run multiple pages simultaneously while keeping in mind that it might have to throttle if the server gets hit too fast from the same IP. ``` Might use available open source software such as python, playwright, beautifulsoup4, pillow, aiofiles, trafilatura | ||
| ▲ | Normal_gaussian 3 hours ago | parent | next [-] | |
This presumably is going to be cheap and effective. Its much easier to wrap a prompt round this and know it works that mess around with crawling it all yourself. You'll still be hand-rolling it if you want to disrespect crawling requirements though. | ||
| ▲ | supermdguy 3 hours ago | parent | prev [-] | |
I’ve actually written a crawler like that before, and still ended up going with Firecrawl for a more recent project. There’s just so many headaches at scale: OOMs from heavy pages, proxies for sites that block cloud IPs, handling nested iframes, etc. | ||