Remix.run Logo
KellyCriterion 4 hours ago

Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery difficult; there are some "open"/public indices, but none of these other indices ever took off

ghm2199 4 hours ago | parent | next [-]

Well sure yes, I don't contend with the fact that its hard, but if the top tech companies joined their heads I am sure if for example, Meta, Apple, MS have enough talent between to make an open source index if only to reap gains from the de-monopolization of it all.

Imustaskforhelp 4 hours ago | parent [-]

I mean, doesn't microsoft have bing?

ghm2199 4 hours ago | parent [-]

Yeah but no one uses it. I am not even sure people that are forced to use it like using it because it was productized it pretty poorly. After all who wants another google? They invested 100 Billion dollars, which is a lot of wasted money TBH.

Search indexes are hard, surely, but if you were to strip it to just a good index on the browser, made it free, kept it fresh, it cannot be 100 billion dollars to build. Then you use this DoJ decision and fight against google to not deny a free index to have equal rights on chrome you can have a massive shot at a win for a LOT less money.

Imustaskforhelp 3 hours ago | parent [-]

> Yeah but no one uses it. I am not even sure people like using it because it was productized it pretty poorly. They invested 100 Billion dollars, which is a lot of wasted money TBH.

I mean... Duckduckgo uses bing api iirc and I use duckduckgo and many people use duckduckgo.

I also used bing once because bing used to cache websites which weren't available in wayback archive, I don't know how but It was pretty cool solution for a problem.

I hate bing too and I am kind of interested in ecosia/qwant's future as well (yes there's kagi too and good luck to kagi as well! but I am currently still staying on duckduckgo)

ghm2199 3 hours ago | parent | next [-]

Duck duck go is really cool. I am almost fully rooting for them and they are my default mobile and web browser.

The small distributed team grinding it out against the goliath. They are awesome and perhaps the right example of what a path like this would look like. Maybe someone from their team can chime in on the difficulties of building a search engine that works in the face of tremendous odds.

dylan604 2 hours ago | parent | prev [-]

I would imagine the users of DDG to be closer to a rounding error than an actual percentage of users. I'd imagine theGoog would love and hate to have 100%. They'd love it because all the data, and hate it as it would prove the monopoly. At the end of the day, the % that is not going to them probably doesn't cause theGoog to lose much sleep

Imustaskforhelp 2 hours ago | parent [-]

It's just so wild how great Duckduckgo is & how under-rated it is.

It's available in all major browsers (Here in zen browser, it doesn't even have a default browser but rather on the start page it asks between the three options, google duckduckgo and bing but yes if you press next it starts from google but zen can even start from ddg, its not such a big deal)

Duckduckgo is super amazing. I mean they are so amazing and their duck.ai or ai actually provides concise data instead of Google's AI

DDG is leaps ahead of Google in terms of everything. I found Kagi to be pleasant too but with PPP it might make sense in Europe and America but privacy isn't/ shouldn't be the only who only pays. So DDG is great for me personally and I can't recommend it enough for most cases.

Brave/Startpage is a second but DDG is so good :)

It just works (for most cases, the only use case I use google is for uploading images to then get more images like this or use an image as a search query and I just do !gi and open images.google.com but I only use this function very rarely, bangs are amazing feature by ddg)

dylan604 an hour ago | parent [-]

I use DDG myself. I just assumed that I'm not a very sophisticated user as I've never had it not serve my needs based on how other people here say it's not very good.

renegat0x0 3 hours ago | parent | prev [-]

Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.

People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?

I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].

On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.

Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].

Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.

[0] https://github.com/rumca-js/crawler-buddy

[1] https://github.com/rumca-js/Internet-Places-Database

[2] https://rumca-js.github.io/search

SyneRyder an hour ago | parent | next [-]

+1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.

Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.

If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.

I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/

ghm2199 42 minutes ago | parent | prev [-]

When I saw the Internet-Places-Database I thought it was an index on some sort of PoI and I got curious. But the personal internet spiel is pretty cool. One good addition to this could be the Foursquare PoI dataset for places search: https://opensource.foursquare.com/os-places/