Remix.run Logo
ghm2199 6 hours ago

> Building a comparable one from scratch is like building a parallel national railroad..

Not too be pedantic here but I do have a noob question or two here:

1. One is building the index, which is a lot harder without a google offering its own API to boot. If other tech companies really wanted to break this monopoly, why can't they just do it — like they did with LLM training for base models with the infamous "pile" dataset — because the upshot of offering this index for public good would break not just google's own monopoly but also other monopolies like android, which will introduce a breath of fresh air into a myriad of UX(mobile devices, browsers, maps, security). So, why don't they just do this already?

2. The other question is about "control", which the DoJ has provided guidance for but not yet enforced. IANAL, but why can't a state's attorney general enforce this?

oh_fiddlesticks 4 hours ago | parent | next [-]

> 1. One is building the index, which is a lot harder without a google offering its own API to boot. If other tech companies really wanted to break this monopoly, why can't they just do it?

FTA:

> Context matters: Google built its index by crawling the open web before robots.txt was a widespread norm, often over publishers’ objections. Today, publishers “consent” to Google’s crawling because the alternative - being invisible on a platform with 90% market share - is economically unacceptable. Google now enforces ToS and robots.txt against others from a position of monopoly power it accumulated without those constraints. The rules Google enforces today are not the rules it played by when building its dominance.

creato 3 hours ago | parent | next [-]

robots.txt was being enforced in court before google even existed, let alone before google got so huge:

> The robots.txt played a role in the 1999 legal case of eBay v. Bidder's Edge,[12] where eBay attempted to block a bot that did not comply with robots.txt, and in May 2000 a court ordered the company operating the bot to stop crawling eBay's servers using any automatic means, by legal injunction on the basis of trespassing.[13][14][12] Bidder's Edge appealed the ruling, but agreed in March 2001 to drop the appeal, pay an undisclosed amount to eBay, and stop accessing eBay's auction information.[15][16]

https://en.wikipedia.org/wiki/Robots.txt

dragonwriter 2 hours ago | parent | next [-]

Not only was eBay v. Bidder's Edge technically after Google existed, not before, more critically the slippery-slope interpretation of California trespass to chattels law the District Court relied on in it was considered and rejected by the California Supreme Court in Intel v. Hamidi (2003), and similar logic applied to other states trespass to chattels laws have been rejected by other courts since; eBay v. Bidder's Edge was an early aberration in the application of the law, not something that established or reflected a lasting norm.

throw-the-towel 2 hours ago | parent | prev | next [-]

Nitpick: Google incorporated in 1998, so, before the Bidder's Edge case.

yuuxheu 2 hours ago | parent | prev [-]

It’s an article from kagi.com, on one of the most kagi-astroturfed forums on the planet. I’m sure they did not expect a single critical reader, as such they do not need facts.

baggachipz 4 hours ago | parent | prev | next [-]

A classic case of climbing the wall, and pulling the ladder up afterward. Others try to build their own ladder, and Google uses their deep pockets and political influence to knock the ladder over before it reaches the top.

dylan604 2 hours ago | parent [-]

Why does Google even need to know about your ladder? Build the bot, scale it up, save all the data, then release. You can now remove the ladder and obey robots.txt just like G. Just like G, once you have the data, you have the data.

Why would you tell G that you are doing something? Why tell a competitor your plans at all? Just launch your product when the product is ready. I know that's anathema to SV startup logic, but in this case it's good business

ghm2199 4 hours ago | parent | prev [-]

True. But the thing is if one says "We will make sure your site is in a world wide freely availabled index" which is kept fresh, google's monopoly ship already begins to take on water. Here is a appropriate line from a completely different domain of rare earth metals from The Economist on the chinese govt's weaponization of rare earths[1]:

> Reducing its share from 90% to 80% may not sound like much, but it would imply a doubling in size of alternative sources of supply, giving China’s customers far more room for manoeuvre.

[1] https://archive.ph/POkHZ#selection-1233.117-1233.302

jeromechoo 3 hours ago | parent | prev | next [-]

Building an index is easy. Building a fresh index is extremely hard.

Ranking an index is hard. It's not just BM25 or cosine similarity. How do you prioritize certain domains over others? How do you rank homepages that typically have no real content in them for navigational queries?

Changing the behavior of 90% of the non-Chinese internet is unraveling 25 years and billions of dollars spent on ensuring Google is the default and sometimes only option.

Historically, it takes a significant technological counter position or anti-trust breakup for a behemoth like Google to lose its footing. Unfortunately for us, Google is currently competing well in the only true technological threat to their existence to appear in decades.

AlienRobot 27 minutes ago | parent [-]

Good news! Google doesn't know how to rank pages either!

KellyCriterion 4 hours ago | parent | prev | next [-]

Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery difficult; there are some "open"/public indices, but none of these other indices ever took off

ghm2199 4 hours ago | parent | next [-]

Well sure yes, I don't contend with the fact that its hard, but if the top tech companies joined their heads I am sure if for example, Meta, Apple, MS have enough talent between to make an open source index if only to reap gains from the de-monopolization of it all.

Imustaskforhelp 4 hours ago | parent [-]

I mean, doesn't microsoft have bing?

ghm2199 4 hours ago | parent [-]

Yeah but no one uses it. I am not even sure people that are forced to use it like using it because it was productized it pretty poorly. After all who wants another google? They invested 100 Billion dollars, which is a lot of wasted money TBH.

Search indexes are hard, surely, but if you were to strip it to just a good index on the browser, made it free, kept it fresh, it cannot be 100 billion dollars to build. Then you use this DoJ decision and fight against google to not deny a free index to have equal rights on chrome you can have a massive shot at a win for a LOT less money.

Imustaskforhelp 3 hours ago | parent [-]

> Yeah but no one uses it. I am not even sure people like using it because it was productized it pretty poorly. They invested 100 Billion dollars, which is a lot of wasted money TBH.

I mean... Duckduckgo uses bing api iirc and I use duckduckgo and many people use duckduckgo.

I also used bing once because bing used to cache websites which weren't available in wayback archive, I don't know how but It was pretty cool solution for a problem.

I hate bing too and I am kind of interested in ecosia/qwant's future as well (yes there's kagi too and good luck to kagi as well! but I am currently still staying on duckduckgo)

ghm2199 3 hours ago | parent | next [-]

Duck duck go is really cool. I am almost fully rooting for them and they are my default mobile and web browser.

The small distributed team grinding it out against the goliath. They are awesome and perhaps the right example of what a path like this would look like. Maybe someone from their team can chime in on the difficulties of building a search engine that works in the face of tremendous odds.

dylan604 2 hours ago | parent | prev [-]

I would imagine the users of DDG to be closer to a rounding error than an actual percentage of users. I'd imagine theGoog would love and hate to have 100%. They'd love it because all the data, and hate it as it would prove the monopoly. At the end of the day, the % that is not going to them probably doesn't cause theGoog to lose much sleep

Imustaskforhelp 2 hours ago | parent [-]

It's just so wild how great Duckduckgo is & how under-rated it is.

It's available in all major browsers (Here in zen browser, it doesn't even have a default browser but rather on the start page it asks between the three options, google duckduckgo and bing but yes if you press next it starts from google but zen can even start from ddg, its not such a big deal)

Duckduckgo is super amazing. I mean they are so amazing and their duck.ai or ai actually provides concise data instead of Google's AI

DDG is leaps ahead of Google in terms of everything. I found Kagi to be pleasant too but with PPP it might make sense in Europe and America but privacy isn't/ shouldn't be the only who only pays. So DDG is great for me personally and I can't recommend it enough for most cases.

Brave/Startpage is a second but DDG is so good :)

It just works (for most cases, the only use case I use google is for uploading images to then get more images like this or use an image as a search query and I just do !gi and open images.google.com but I only use this function very rarely, bangs are amazing feature by ddg)

dylan604 an hour ago | parent [-]

I use DDG myself. I just assumed that I'm not a very sophisticated user as I've never had it not serve my needs based on how other people here say it's not very good.

renegat0x0 3 hours ago | parent | prev [-]

Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.

People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?

I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].

On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.

Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].

Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.

[0] https://github.com/rumca-js/crawler-buddy

[1] https://github.com/rumca-js/Internet-Places-Database

[2] https://rumca-js.github.io/search

SyneRyder an hour ago | parent | next [-]

+1 so much for this. I have been doing the same, an SQLite database of my "own personal internet" of the sites I actually need. I use it as a tiny supplementary index for a metasearch engine I built for myself - which I actually did to replace Kagi.

Building a metasearch engine is not hard to do (especially with AI now). It's so liberating when you control the ranking algorithm, and can supplement what the big engines provide as results with your own index of sites and pages that are important to you. I admit, my results & speed aren't as good as Kagi, but still good enough that my personal search engine has been my sole search engine for a year now.

If a site doesn't want me to crawl them, that's fine. I probably don't need them. In practice it hasn't gotten in the way as much as I might have thought it would. But I do still rely on Brave / Mojeek / Marginalia to do much of the heavy lifting for me.

I especially appreciate Marginalia for publicly documenting as much about building a search engine as they have: https://www.marginalia.nu/log/

ghm2199 42 minutes ago | parent | prev [-]

When I saw the Internet-Places-Database I thought it was an index on some sort of PoI and I got curious. But the personal internet spiel is pretty cool. One good addition to this could be the Foursquare PoI dataset for places search: https://opensource.foursquare.com/os-places/

hamdingers 5 hours ago | parent | prev | next [-]

> If other tech companies really wanted to break this monopoly, why can't they just do it

Google is a verb, nobody can compete with that level of mindshare.

observationist 5 hours ago | parent | next [-]

A big part of it is about the legal minefield if you presented any sort of real threat to Google. Nobody wants to wager billions in infrastructure and IP against Google or Apple or Microsoft, even if you could whip up a viable competing product in a weekend (for any given product.)

Part of it is also the ecosystem - don't threaten adtech, because the wrong lawsuits, the wrong consumer trend, the wrong innovation that undercuts the entire adtech ecosystem means they lose their goose with the golden eggs.

Even if Kagi or some other company achieves legitimate mindshare in search, they still don't have the infrastructure and ancillary products and cash reserves of Google, etc. The second they become a real "threat" in Google's eyes, they'd start seeing lawsuits over IP and hostile and aggressive resource acquisitions to freeze out their expansion, arbitrary deranking in search results, possible heightened government audits and regulatory interactions, and so on. They have access to a shit ton of legal levers, not to mention the whole endless flood of dirty tricks money can buy (not that Google would ever do that.)

They're institutional at this point; they're only going away if/when government decides to break it up and make things sane again.

wongarsu 5 hours ago | parent | prev | next [-]

Xerox is a verb, but most copy machines I see are made by their competition

hamdingers 5 hours ago | parent [-]

Wonder why that could be?

https://www.nytimes.com/1975/07/31/archives/xerox-settlement...

eikenberry 5 hours ago | parent | prev | next [-]

Kleenex isn't the only brand of tissues sold in stores.

iamacyborg an hour ago | parent | prev | next [-]

How’s that working out for Hoover in the UK?

cowsandmilk 3 hours ago | parent | prev | next [-]

Licensing their index doesn’t change that.

Zyst 5 hours ago | parent | prev [-]

So were AOL, and Skype

dylan604 2 hours ago | parent [-]

I don't ever recall anyone using AOL as a verb. How would you do that?

walls 5 hours ago | parent | prev | next [-]

A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.

Imustaskforhelp 4 hours ago | parent | next [-]

> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.

I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?

ghm2199 3 hours ago | parent [-]

You can read https://hackernoon.com/the-long-now-of-the-web-inside-the-in... it was a nice look into their infra structure. One could theoretically build it. A few things stand out:

1. IIUC depends a lot on "Save Page Now" democratization, which could work, but its not like a crawler.

2. In absence of alexa they depend quite heavily on common crawl, which is quite crazy because there literally is no other place to go. I don't think they can use google's syndicated API, cause they would then start showing ads in their database, which is garbage that would strain their tiny storage budget.

3. Minor from a software engineering perspective but important for survival of the company: since they are an artifact of record storage, to convert that to an index would need a good legal team to battle google to argue. They do that the DoJ's recent ruling in their favor.

deepsquirrelnet 4 hours ago | parent | prev | next [-]

I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.

ghm2199 3 hours ago | parent [-]

I think someone had commented on another thread about SerpAPI the other day that common crawl is quite small. It would be a start, I think the key to a good index people will use is freshness of the results. You need good recall for a search engine, precision tuning/re-ranking is not going to help otherwise.

charcircuit 3 hours ago | parent | prev | next [-]

If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.

5 hours ago | parent | prev [-]
[deleted]
hsuduebc2 5 hours ago | parent | prev | next [-]

I don’t think it’s comparable to today’s AI race.

Google has a monopoly, an entrenched customer base, and stable revenue from a proven business model. Anyone trying to compete would have to pour massive money into infrastructure and then fight Google for users. In that game, Google already won.

The current AI landscape is different. Multiple players are competing in an emerging field with an uncertain business model. We’re still in the phase of building better products, where companies started from more similar footing and aren’t primarily battling for customers yet. In that context, investing heavily in the core technology can still make financial sense. A better comparison might be the early days of car makers, or the web browser wars before the market settled.

ghm2199 4 hours ago | parent [-]

> ... stable revenue from a proven business mode... In that game, Google already won.

But if they were to pour that money strategically to capture market share one of two things would happen if google was replaced/lost share:

1. it would be the start of the commoditization of search. i.e. search engine/index would become a commodity and more specialized and people could buy what they want and compete.

2. A new large tech company takes rein. In which case it would be as bad as this time.

Like what I don't get is that if other big tech companies actually broke apart monopoly on search, several google dominos in mobile devices, browser tech, location capabilities would fall. It would be a massive injection of new competition into the economy, lots of people would spend more dollars across the space(and ad driven buying too) money would not accrue in an offshore tax haven in ireland

To play the devils advocate, I think the only reason its not happening is because meta, apple, microsoft have very different moats/business models to profit off. They all have been stung one time or another is small or big ways for trying to build something that could compete but failed. MS with bing, Meta with facebook search, Foursquare — not big tech but still — with Maurauder's Map.

xnx 5 hours ago | parent | prev | next [-]

> If other tech companies really wanted to break this monopoly, why can't they just do it

Companies would rather sue than try and compete by investing their own money.

paxys 5 hours ago | parent | prev [-]

Apple had a chance to break Google's search monopoly, but they chose to take billions from them instead.

Microsoft had a chance (well another chance, after they gave up IE's lead) to break up Google's browser monopoly, but they decided to use Chromium for free instead.

Ultimately all these decisions come down to what's more profitable, not what's in the best interests of the public. We have learned this lesson x1000000. Stop relying on corporations to uphold freedoms (software or otherwise), becuase that simply isn't going to happen.

charcircuit 3 hours ago | parent [-]

>but they chose to take billions from them instead.

They chose to use Google with a revenue sharing agreement. Google is very well monetized. It would be very difficult for Apple to monetize their own search as good as Google can.

>they decided to use Chromium

Windows ships with Microsoft Edge as the browser which Microsoft has full control over.