| ▲ | luizfelberti 3 days ago |
| I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs. I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go... |
|
| ▲ | mhitza 3 days ago | parent | next [-] |
| You might want to bookmark https://openwebsearch.eu/open-webindex/ While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear. |
| |
| ▲ | 3RTB297 2 days ago | parent [-] | | You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free? I'll add it to the mile-long list of things that should exist and be online public goods. |
|
|
| ▲ | moduspol 2 days ago | parent | prev | next [-] |
| Is the common crawl usable for something like this? https://commoncrawl.org |
| |
| ▲ | chiefsearchaco 2 days ago | parent | next [-] | | I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls. | |
| ▲ | giancarlostoro 2 days ago | parent | prev [-] | | Most likely it is, the issue then becomes being able to store and afford the storage for all the files. | | |
| ▲ | moduspol 2 days ago | parent [-] | | Sure, and that's not easy, but it's a lot easier than having to crawl the entire public Internet yourself. |
|
|
|
| ▲ | wordpad 2 days ago | parent | prev | next [-] |
| Why can't crawling be crowd sourced? It would solve ip rotation and spread the load |
| |
|
| ▲ | 6510 2 days ago | parent | prev | next [-] |
| The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them? |
|
| ▲ | ge96 3 days ago | parent | prev [-] |
| The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs. |
| |
| ▲ | kccqzy 3 days ago | parent [-] | | Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets. | | |
| ▲ | Bratmon 2 days ago | parent | next [-] | | Not just the black market anymore! https://www.proxyrack.com/residential-proxies/ | |
| ▲ | immibis 2 days ago | parent | prev [-] | | you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one. | | |
| ▲ | typpilol 2 days ago | parent [-] | | I've heard a few horror stories... Since the people using residential proxies aren't necessarily always good people |
|
|
|