Remix.run Logo
Waiting for dawn in search: Search index, Google rulings and impact on Kagi(blog.kagi.com)
177 points by josephwegner 5 hours ago | 121 comments
ghm2199 4 hours ago | parent | next [-]

> Building a comparable one from scratch is like building a parallel national railroad..

Not too be pedantic here but I do have a noob question or two here:

1. One is building the index, which is a lot harder without a google offering its own API to boot. If other tech companies really wanted to break this monopoly, why can't they just do it — like they did with LLM training for base models with the infamous "pile" dataset — because the upshot of offering this index for public good would break not just google's own monopoly but also other monopolies like android, which will introduce a breath of fresh air into a myriad of UX(mobile devices, browsers, maps, security). So, why don't they just do this already?

2. The other question is about "control", which the DoJ has provided guidance for but not yet enforced. IANAL, but why can't a state's attorney general enforce this?

oh_fiddlesticks 3 hours ago | parent | next [-]

> 1. One is building the index, which is a lot harder without a google offering its own API to boot. If other tech companies really wanted to break this monopoly, why can't they just do it?

FTA:

> Context matters: Google built its index by crawling the open web before robots.txt was a widespread norm, often over publishers’ objections. Today, publishers “consent” to Google’s crawling because the alternative - being invisible on a platform with 90% market share - is economically unacceptable. Google now enforces ToS and robots.txt against others from a position of monopoly power it accumulated without those constraints. The rules Google enforces today are not the rules it played by when building its dominance.

creato 2 hours ago | parent | next [-]

robots.txt was being enforced in court before google even existed, let alone before google got so huge:

> The robots.txt played a role in the 1999 legal case of eBay v. Bidder's Edge,[12] where eBay attempted to block a bot that did not comply with robots.txt, and in May 2000 a court ordered the company operating the bot to stop crawling eBay's servers using any automatic means, by legal injunction on the basis of trespassing.[13][14][12] Bidder's Edge appealed the ruling, but agreed in March 2001 to drop the appeal, pay an undisclosed amount to eBay, and stop accessing eBay's auction information.[15][16]

https://en.wikipedia.org/wiki/Robots.txt

dragonwriter 11 minutes ago | parent | next [-]

Not only was eBay v. Bidder's Edge technically after Google existed, not before, more critically the slippery-slope interpretation of California trespass to chattels law the District Court relied on in it was considered and rejected by the California Supreme Court in Intel v. Hamidi (2003), and similar logic applied to other states trespass to chattels laws have been rejected by other courts since; eBay v. Bidder's Edge was an early aberration in the application of the law, not something that established or reflected a lasting norm.

throw-the-towel 44 minutes ago | parent | prev | next [-]

Nitpick: Google incorporated in 1998, so, before the Bidder's Edge case.

yuuxheu 42 minutes ago | parent | prev [-]

It’s an article from kagi.com, on one of the most kagi-astroturfed forums on the planet. I’m sure they did not expect a single critical reader, as such they do not need facts.

baggachipz 2 hours ago | parent | prev | next [-]

A classic case of climbing the wall, and pulling the ladder up afterward. Others try to build their own ladder, and Google uses their deep pockets and political influence to knock the ladder over before it reaches the top.

dylan604 36 minutes ago | parent [-]

Why does Google even need to know about your ladder? Build the bot, scale it up, save all the data, then release. You can now remove the ladder and obey robots.txt just like G. Just like G, once you have the data, you have the data.

Why would you tell G that you are doing something? Why tell a competitor your plans at all? Just launch your product when the product is ready. I know that's anathema to SV startup logic, but in this case it's good business

ghm2199 2 hours ago | parent | prev [-]

True. But the thing is if one says "We will make sure your site is in a world wide freely availabled index" which is kept fresh, google's monopoly ship already begins to take on water. Here is a appropriate line from a completely different domain of rare earth metals from The Economist on the chinese govt's weaponization of rare earths[1]:

> Reducing its share from 90% to 80% may not sound like much, but it would imply a doubling in size of alternative sources of supply, giving China’s customers far more room for manoeuvre.

[1] https://archive.ph/POkHZ#selection-1233.117-1233.302

jeromechoo an hour ago | parent | prev | next [-]

Building an index is easy. Building a fresh index is extremely hard.

Ranking an index is hard. It's not just BM25 or cosine similarity. How do you prioritize certain domains over others? How do you rank homepages that typically have no real content in them for navigational queries?

Changing the behavior of 90% of the non-Chinese internet is unraveling 25 years and billions of dollars spent on ensuring Google is the default and sometimes only option.

Historically, it takes a significant technological counter position or anti-trust breakup for a behemoth like Google to lose its footing. Unfortunately for us, Google is currently competing well in the only true technological threat to their existence to appear in decades.

KellyCriterion 3 hours ago | parent | prev | next [-]

Scraping is hard. Very good scraping is even harder. And today, being a scraping business is veeery difficult; there are some "open"/public indices, but none of these other indices ever took off

ghm2199 3 hours ago | parent | next [-]

Well sure yes, I don't contend with the fact that its hard, but if the top tech companies joined their heads I am sure if for example, Meta, Apple, MS have enough talent between to make an open source index if only to reap gains from the de-monopolization of it all.

Imustaskforhelp 2 hours ago | parent [-]

I mean, doesn't microsoft have bing?

ghm2199 2 hours ago | parent [-]

Yeah but no one uses it. I am not even sure people that are forced to use it like using it because it was productized it pretty poorly. After all who wants another google? They invested 100 Billion dollars, which is a lot of wasted money TBH.

Search indexes are hard, surely, but if you were to strip it to just a good index on the browser, made it free, kept it fresh, it cannot be 100 billion dollars to build. Then you use this DoJ decision and fight against google to not deny a free index to have equal rights on chrome you can have a massive shot at a win for a LOT less money.

Imustaskforhelp 2 hours ago | parent [-]

> Yeah but no one uses it. I am not even sure people like using it because it was productized it pretty poorly. They invested 100 Billion dollars, which is a lot of wasted money TBH.

I mean... Duckduckgo uses bing api iirc and I use duckduckgo and many people use duckduckgo.

I also used bing once because bing used to cache websites which weren't available in wayback archive, I don't know how but It was pretty cool solution for a problem.

I hate bing too and I am kind of interested in ecosia/qwant's future as well (yes there's kagi too and good luck to kagi as well! but I am currently still staying on duckduckgo)

ghm2199 an hour ago | parent | next [-]

Duck duck go is really cool. I am almost fully rooting for them and they are my default mobile and web browser.

The small distributed team grinding it out against the goliath. They are awesome and perhaps the right example of what a path like this would look like. Maybe someone from their team can chime in on the difficulties of building a search engine that works in the face of tremendous odds.

dylan604 29 minutes ago | parent | prev [-]

I would imagine the users of DDG to be closer to a rounding error than an actual percentage of users. I'd imagine theGoog would love and hate to have 100%. They'd love it because all the data, and hate it as it would prove the monopoly. At the end of the day, the % that is not going to them probably doesn't cause theGoog to lose much sleep

Imustaskforhelp a few seconds ago | parent [-]

It's just so wild how great Duckduckgo is & how under-rated it is.

It's available in all major browsers (Here in zen browser, it doesn't even have a default browser but rather on the start page it asks between the three options, google duckduckgo and bing but yes if you press next it starts from google but zen can even start from ddg, its not such a big deal)

Duckduckgo is super amazing. I mean they are so amazing and their duck.ai or ai actually provides concise data instead of Google's AI

DDG is leaps ahead of Google in terms of everything. I found Kagi to be pleasant too but with PPP it might make sense in Europe and America but privacy isn't/ shouldn't be the only who only pays. So DDG is great for me personally and I can't recommend it enough for most cases.

Brave/Startpage is a second but DDG is so good :)

It just works

renegat0x0 an hour ago | parent | prev [-]

Scraping is hard, and is not hard that much at the same time. There are many projects about scraping, so with a few lines you can do implement scraper using curl cffi, or playwright.

People complain that user-agent need to be filled. Boo-hoo, are we on hacker news, or what? Can't we just provide cookies, and user-agent? Not a big deal, right?

I myself have implemented a simple solution that is able to go through many hoops, and provide JSON response. Simple and easy [0].

On the other hand it was always an arms race. It will be. Eventually every content will be protected via walled gardens, there is no going around it.

Search engines affect me less, and less every day. I have my own small "index" / "bookmarks" with many domains, github projects, youtube channels [1].

Since the database is so big, the most used by me places is extracted into simple and fast web page using SQLite table [2]. Scraping done right is not a problem.

[0] https://github.com/rumca-js/crawler-buddy

[1] https://github.com/rumca-js/Internet-Places-Database

[2] https://rumca-js.github.io/search

hamdingers 4 hours ago | parent | prev | next [-]

> If other tech companies really wanted to break this monopoly, why can't they just do it

Google is a verb, nobody can compete with that level of mindshare.

observationist 3 hours ago | parent | next [-]

A big part of it is about the legal minefield if you presented any sort of real threat to Google. Nobody wants to wager billions in infrastructure and IP against Google or Apple or Microsoft, even if you could whip up a viable competing product in a weekend (for any given product.)

Part of it is also the ecosystem - don't threaten adtech, because the wrong lawsuits, the wrong consumer trend, the wrong innovation that undercuts the entire adtech ecosystem means they lose their goose with the golden eggs.

Even if Kagi or some other company achieves legitimate mindshare in search, they still don't have the infrastructure and ancillary products and cash reserves of Google, etc. The second they become a real "threat" in Google's eyes, they'd start seeing lawsuits over IP and hostile and aggressive resource acquisitions to freeze out their expansion, arbitrary deranking in search results, possible heightened government audits and regulatory interactions, and so on. They have access to a shit ton of legal levers, not to mention the whole endless flood of dirty tricks money can buy (not that Google would ever do that.)

They're institutional at this point; they're only going away if/when government decides to break it up and make things sane again.

wongarsu 3 hours ago | parent | prev | next [-]

Xerox is a verb, but most copy machines I see are made by their competition

hamdingers 3 hours ago | parent [-]

Wonder why that could be?

https://www.nytimes.com/1975/07/31/archives/xerox-settlement...

eikenberry 3 hours ago | parent | prev | next [-]

Kleenex isn't the only brand of tissues sold in stores.

cowsandmilk 2 hours ago | parent | prev | next [-]

Licensing their index doesn’t change that.

Zyst 3 hours ago | parent | prev [-]

So were AOL, and Skype

dylan604 34 minutes ago | parent [-]

I don't ever recall anyone using AOL as a verb. How would you do that?

walls 3 hours ago | parent | prev | next [-]

A huge amount of the web is only crawlable with a googlebot user-agent and specific source IPs.

Imustaskforhelp 2 hours ago | parent | next [-]

> And given you-know-what, the battle to establish a new search crawler will be harder than ever. Crawlers are now presumed guilty of scraping for AI services until proven innocent.

I have always wondered but how does wayback machine work, is there no way that we can use wayback archive and then run a index on top of every wayback archive somehow?

ghm2199 2 hours ago | parent [-]

You can read https://hackernoon.com/the-long-now-of-the-web-inside-the-in... it was a nice look into their infra structure. One could theoretically build it. A few things stand out:

1. IIUC depends a lot on "Save Page Now" democratization, which could work, but its not like a crawler.

2. In absence of alexa they depend quite heavily on common crawl, which is quite crazy because there literally is no other place to go. I don't think they can use google's syndicated API, cause they would then start showing ads in their database, which is garbage that would strain their tiny storage budget.

3. Minor from a software engineering perspective but important for survival of the company: since they are an artifact of record storage, to convert that to an index would need a good legal team to battle google to argue. They do that the DoJ's recent ruling in their favor.

deepsquirrelnet 2 hours ago | parent | prev | next [-]

I do not know a lot about this subject, but couldn’t you make a pretty decent index off of common crawl? It seems to me the bar is so low you wouldn’t have to have everything. Especially if your goal was not monetization with ads.

ghm2199 2 hours ago | parent [-]

I think someone had commented on another thread about SerpAPI the other day that common crawl is quite small. It would be a start, I think the key to a good index people will use is freshness of the results. You need good recall for a search engine, precision tuning/re-ranking is not going to help otherwise.

charcircuit an hour ago | parent | prev [-]

If a crawler offered enough money they could be allowed too. It's not like Google has exclusive crawling rights.

hsuduebc2 4 hours ago | parent | prev | next [-]

I don’t think it’s comparable to today’s AI race.

Google has a monopoly, an entrenched customer base, and stable revenue from a proven business model. Anyone trying to compete would have to pour massive money into infrastructure and then fight Google for users. In that game, Google already won.

The current AI landscape is different. Multiple players are competing in an emerging field with an uncertain business model. We’re still in the phase of building better products, where companies started from more similar footing and aren’t primarily battling for customers yet. In that context, investing heavily in the core technology can still make financial sense. A better comparison might be the early days of car makers, or the web browser wars before the market settled.

ghm2199 2 hours ago | parent [-]

> ... stable revenue from a proven business mode... In that game, Google already won.

But if they were to pour that money strategically to capture market share one of two things would happen if google was replaced/lost share:

1. it would be the start of the commoditization of search. i.e. search engine/index would become a commodity and more specialized and people could buy what they want and compete.

2. A new large tech company takes rein. In which case it would be as bad as this time.

Like what I don't get is that if other big tech companies actually broke apart monopoly on search, several google dominos in mobile devices, browser tech, location capabilities would fall. It would be a massive injection of new competition into the economy, lots of people would spend more dollars across the space(and ad driven buying too) money would not accrue in an offshore tax haven in ireland

To play the devils advocate, I think the only reason its not happening is because meta, apple, microsoft have very different moats/business models to profit off. They all have been stung one time or another is small or big ways for trying to build something that could compete but failed. MS with bing, Meta with facebook search, Foursquare — not big tech but still — with Maurauder's Map.

xnx 3 hours ago | parent | prev | next [-]

> If other tech companies really wanted to break this monopoly, why can't they just do it

Companies would rather sue than try and compete by investing their own money.

paxys 3 hours ago | parent | prev [-]

Apple had a chance to break Google's search monopoly, but they chose to take billions from them instead.

Microsoft had a chance (well another chance, after they gave up IE's lead) to break up Google's browser monopoly, but they decided to use Chromium for free instead.

Ultimately all these decisions come down to what's more profitable, not what's in the best interests of the public. We have learned this lesson x1000000. Stop relying on corporations to uphold freedoms (software or otherwise), becuase that simply isn't going to happen.

charcircuit an hour ago | parent [-]

>but they chose to take billions from them instead.

They chose to use Google with a revenue sharing agreement. Google is very well monetized. It would be very difficult for Apple to monetize their own search as good as Google can.

>they decided to use Chromium

Windows ships with Microsoft Edge as the browser which Microsoft has full control over.

WhyNotHugo 4 hours ago | parent | prev | next [-]

The statistics in this article sound like garbage to me.

Google used by 90% or the world?

~20% of the human population lives in countries where Google is blocked.

OTOH, Baidu is the #1 search engine in China, which has over 15% of the world’s population… but doesn’t reach 1%?

These stats are made measuring US-based traffic, rather than “worldwide” as they claim.

weisnobody 3 hours ago | parent | next [-]

Yes the stats don't make sense. It appears to be an issue with StatsCounter.

The Search Engine wikipedia article [1] has a section on Russia and East Asia market share, which confirms that the roll up used for world wide counts is off, unless the number of people using the Internet is drastically different in some of the countries.

Russia

  * Yandex: 70.7%
  * Google: 23.3%
China:

  * Baidu: 59.3%
  * Other domestic engines: "smaller shares"
  * Bing: 13.6%
South Korea:

  * Naver: 59.8%
  * Google: 35.4%
Japan: * Google: 76.2% * Yahoo! Japan: 15.8%

[1] https://en.wikipedia.org/wiki/Search_engine#Market_share

dylan604 26 minutes ago | parent [-]

Maybe it's the same logic that says you can lower the prices of things >100%

lolc 4 hours ago | parent | prev | next [-]

I guess they'd argue that the people in China don't count, because people in China don't get to choose Google. But yeah, the stats they use from "StatCounter" are clearly not representative for what the world uses.

elAhmo 3 hours ago | parent [-]

You can argue that people outside of China don't get to choose something other than Google. Sure, there are recent pushes with default search engine choices and similar initiatives, but there is a reason why Google is paying hundreds of millions of dollars to be the default search engine.

ivanjermakov 3 hours ago | parent | prev | next [-]

To be fair, Kagi won't be used in China either.

0x1ch 4 hours ago | parent | prev [-]

Google is only blocked in places where it would already be hard for a company with morals to work in, if not outright blocked as well. This probably represents traffic globally, excluding those places.

Instead of downvoting blindly, please state which countries are currently blocking Google that would willingly allow Kagi, a AI/Privacy focused search engine company to exist in their domain? The results may surprise you!

direwolf20 2 hours ago | parent [-]

Google is not blocked in the USA.

0x1ch an hour ago | parent [-]

Interesting. I'm in the US and use Kagi everyday.

dylan604 25 minutes ago | parent [-]

I read it more as "company having morals". Not many US companies have "morals".

0x1ch 21 minutes ago | parent [-]

Google doesn't, Kagi seems to (hopefully). I meant this more as a jab at countries willing to block Google, as they're generally dictatorships / authoritarian in nature. Oh the irony, as an american saying this in 2026....

pfist 2 hours ago | parent | prev | next [-]

I am rooting for Kagi here, and I applaud their transparency on such matters. It is quite enlightening for someone like me who understands technology but knows little about the inner workings of search.

It remains to be seen how or if the remedies will be enforced, and, of course, how Google will choose to comply with them. I am not optimistic, but at least there is some hope.

As an aside: The 1998 white paper by Brin and Page is remarkable to read knowing what Google has become.

ApolloFortyNine 2 hours ago | parent | prev | next [-]

With Google's search engine making almost $200 billion a year in revenue, I'm not sure Kagi could afford what market rates would be here. They also spent billions developing the technology to crawl, index, and rank billions of pages, factoring that in, again I don't think a good price can be put on it.

What even is market rate? Kagi themselves admits there's no market, the one competitor quit providing the service.

Obviously Google doesn't want to become an index provider.

dangoor 2 hours ago | parent [-]

According to the article, the judge's memorandum said about index data access:

> Google must provide Web Search Index data (URLs, crawl metadata, spam scores) at marginal cost.

I'm guessing that the "marginal cost" of a search is small and it's not connected to the how much ad revenue that search is worth.

senko an hour ago | parent | prev | next [-]

A full up-to-date index of the searchable web should be a public commons good.

This would not only allow better competition in search, but fix the "AI scrapers" problem: No need to scrape if the data has already been scraped.

Crawling is technically a solved problem, as witnessed by everyone and their dog seemingly crawling everything. If pooled together, it would be cheaper and less resource intensive.

The secret sauce is in what happens afterwards, anyway.

Here's the idea in more detail: https://senkorasic.com/articles/ai-scraper-tragedy-commons

I'm under no illusion something like that will happen .. but it could.

azornathogron 3 minutes ago | parent | next [-]

Is crawling really solved?

Any naive crawler is going to run into the problem that servers can give different responses to different clients which means you can show the crawler something different to what you show real users. That turns crawling into an antagonistic problem where the crawler developers need to continually be on the lookout for new ways of servers doing malicious things that poison/mislead the index.

Otherwise you'll return junk spam results from spammers that lied to the crawler.

I've never done it so maybe it's easier than I imagine but I wouldn't be quick to assume that crawling is solved.

moebrowne 13 minutes ago | parent | prev [-]

Isn't this what CommonCrawl are doing?

https://commoncrawl.org/

whs 5 hours ago | parent | prev | next [-]

>Google: Google does not offer a public search API. The only available path is an ad-syndication bundle with no changes to result presentation - the model Startpage uses. Ad syndication is a non-starter for Kagi’s ad-free subscription model.[^1]

>Because direct licensing isn’t available to us on compatible terms, we - like many others - use third-party API providers for SERP-style results (SERP meaning search engine results page). These providers serve major enterprises (according to their websites) including Nvidia, Adobe, Samsung, Stanford, DeepMind, Uber, and the United Nations.

The customer list matches what is listed on SerpAPI's page (interestingly, DeepMind is on Kagi's list while they're a Google company...). I suppose Kagi needs to pen this because if SerpAPI shuts down they may lose access to Google, but they may already have utilize multiple providers. In the past, Kagi employees have said that they have access to Google API, but it seems that it was not the case?

As a customer, the major implication of this is that even if Kagi's privacy policy says they try to not log your queries, it is sent to Google and still subject to Google's consumer privacy policy. Even if it is anonymized, your queries can still end up contributing to Google Trends.

xnx 5 hours ago | parent | prev | next [-]

> Because direct licensing isn’t available to us on compatible terms, we - like many others - use third-party API providers for SERP-style results

Crazy for a company to admit: "Google won't let us whitelabel their core product so we steal it and resell it."

eli 4 hours ago | parent | next [-]

Seems like an open question as to whether that violates any laws.

Another way to look at it is that if you publish a service on the web, you have limited rights to restrict what people do with it.

Isn't that the logic Google search relies on in the first place? I didn't give permission for Google to crawl and index and deep link to my site (let alone summarize and train LLMs on it). They just did it anyway, because it's on a public website.

malfist an hour ago | parent [-]

Google's stance is "I can copy you and you can't stop me" as well as "You can't copy me, I'll sue you"

techjamie 4 hours ago | parent | prev | next [-]

What's the alternative? Building a competing search index as a relative nobody on the web is very difficult, from the outset, and is made more difficult from sites taking extra measures to stop bots in general now.

Google's crawler is given special privileges in this right and can bypass basically all bot checks. Anyone else has to just wade through the mud and accept they can't index much of the web.

direwolf20 5 hours ago | parent | prev | next [-]

Pretty standard business practice though. There's no ethics in making money.

roywiggins 2 hours ago | parent | prev | next [-]

Is it much different than what Google AI Summaries do?

timeon 2 hours ago | parent | prev | next [-]

Even the article posted (and search itself) has Google IP address.

shadowgovt 4 hours ago | parent | prev | next [-]

But in this current climate, they can admit it and then dare Google to tell them to stop... After Google has just had an antitrust ruling against it for dominating the search market.

Google doesn't really have a leg to stand on and they know it.

Ar-Curunir 4 hours ago | parent | prev [-]

Strange to pick on Kagi when there's much bigger companies on that list.

xnx 3 hours ago | parent [-]

Those companies allegedly have used SerpAPI (probably to check visibility), but not to resell a Google Search knock-off.

jiehong 14 minutes ago | parent | prev | next [-]

I think one side problem is that part of the web is not even searchable with a search engine.

Here are some examples:

- Discord

- WeChat (is it the web?)

- Rednote

- TikTok (partially)

- X (partially)

- JSTOR (it finds daily, but you find more stuff on the website directly)

- any stuff with a login, obviously.

ajdude 5 hours ago | parent | prev | next [-]

Does anyone else use the phrase "I'm going to google XYZ" while referring to actually searching it up on Kagi, DDG, or another search engine?

eli 4 hours ago | parent | next [-]

Ironically this is a bad thing for Google from a legal standpoint. If a term becomes "genericized" then it can lose trademark protection.

"Aspirin" is a famous example. It used to be a brand name for acetylsalicylic acid medication, but became such a common way to refer to it that in the US any company can now use it.

1-more 4 hours ago | parent [-]

Apparently the "lost in the Treaty of Versailles" explanation is a bit of a just-so story: https://history.stackexchange.com/questions/55729/why-did-ba...

shervinafshar 4 hours ago | parent | prev | next [-]

I've been using Kagi for the past few years, but I try to use a brand-agnostic language talking about web search; e.g. "I'm gonna search [the web] for it"; "Use your favorite search engine to look it up".

dooglius 4 hours ago | parent | prev | next [-]

Yeah, I don't feel the need to have conversations go on a tangent about explaining what Kagi is

jeremyjh 4 hours ago | parent | prev | next [-]

Yes, it’s like Xerox or Kleenex except it’s actually still a monopoly. In a happy Kagi user but I know hardly anyone else is.

kqr 4 hours ago | parent | prev | next [-]

I used to. Even when I actually used DDG. Now that I use Kagi (and thus am on the second web search service after I stopped using Google) it started to feel silly so I say "search the web" these days.

pixl97 4 hours ago | parent | prev | next [-]

Yes, but more in the past than now, simply because almost everybody seems to use google itself.

For example I'd hear people say "I'll Google that", then use Yahoo when they were still a major search engine.

dijksterhuis 4 hours ago | parent | prev | next [-]

nope, i say “i’m going to search for XYZ” or similar

bronson 3 hours ago | parent | prev | next [-]

Now my family usually says "I'm going to ask AI."

matkoniecz 3 hours ago | parent | prev | next [-]

yes, me

chroma205 4 hours ago | parent | prev [-]

> Does anyone else use the phrase "I'm going to google XYZ" while referring to actually searching it up on Kagi, DDG, or another search engine?

Not me. I only use Google.

Never used Kagi or DDG. Don’t care enough.

keeda 26 minutes ago | parent | prev | next [-]

Google's advantage is not just in its index and algorithms, it is that it has built a self-reinforcing flywheel that data mines human attention at massive scale to improve their search results, which in turn brings in more attention.

This comment (https://news.ycombinator.com/item?id=46709957) points out that Google got its start via PageRank, which essentially ranked sites based on links created by humans. As such, its primary heuristic was what humans thought was good content. Turns out, this is still how they operate.

Basically, as people search and navigate the results, Google harvests their clicks, hovers, dwell-time and other browsing behavior -- i.e. tracking what they pay attention to -- to extract critical signals to "learn" which pages the users actually found useful for the given query. This helps it rank results better and improve search overall, which keeps people coming back, which in turns gives them more queries and data, which improves their results... a never-ending flywheel.

And competitors have no hope of matching this, because if you look at the infrastructure Google has built to harvest this data, it is so much bigger than the massive index! They harvest data through Chrome, ad tracking, Android, Google Analytics, cookies (for which they built Gmail!), YouTube, Maps, and so much more. So to compete with Google Search, you don't need just a massive index, you also need the extensive web infra footprint to harvest user interactions at massive scale, meaning the most popular and widely deployed browser, mobile OS, ad footprint, analytics, email provider, maps...

This also explains why Google spends so many billions in "traffic acquisition costs" (i.e. payments for being the Search default) every year, because that is a direct driver to both, 1) ad revenue, and 2) maintaining its search quality.

This wasn't really a secret, but it turned out to be a major point in the recent Antitrust trial, which is why the proposed remedies (as TFA mentions) include the sharing of search index and "interaction data."

We all knew "if you're not paying for it, you're the product" but the fascinating thing with Google is: - They charge advertisers to monetize our attention; - They harvest our attention to better rank results; - They provide better results, which keeps us coming back, and giving them even more of our attention!

Attention is all you need, indeed.

sabslikesobs 3 hours ago | parent | prev | next [-]

I like that there's a list of primary sources at the bottom.

Kagi's AI assistant has been satisfying compared to Claude and ChatGPT, both of which insisted on having a personality no matter what my instructions said. Trying to do well-sourced research always pissed me off. With Kagi it gives me a summary of sources it's found and that's it!

direwolf20 5 hours ago | parent | prev | next [-]

I hope they cache search results to further reduce the number of calls to Google.

And Marginalia Search was not mentioned? Marginalia Search says they are licensing their index to Kagi. Perhaps it's counted under "Our own small-web index" which is highly misleading if true.

z64 3 hours ago | parent | next [-]

There is a practical limit that we can't cache results for too long; Search engine users are particularly sensitive to stale data, especially around current events. Without a holistic and realiable way to know when the cache ought to be invalidated, our caching is mostly focused on mitigating "abuse", e.g., someone / bunch of people spamming the same search in a short timespan; no sense in repeating all those upstream calls.

Most "cost saving engineering" is involved in finding cases/hueristics where we only need to use a subset of sources and omitting calls in the first place, without compromising quality. For example, we probably don't need to fire all of our sources to service a query like "youtube" or "facebook".

Marginalia data is physically consolidated into the same infra that we use for small web results in our SERP, but also among other small scale sources besides those two. That line is simply referring directly to https://kagi.com/smallweb (https://github.com/kagisearch/smallweb).

xnx 3 hours ago | parent | prev | next [-]

> "Our own small-web index"

Has Kagi ever said what this is? I wouldn't be at all surprised if it is just kagi.com pages or a download of Wikipedia.

z64 3 hours ago | parent [-]

https://github.com/kagisearch/smallweb

packetlost 5 hours ago | parent | prev [-]

The index is not necessarily the code, but the dataset. IMO it would be better to be more open about the technical stack, but I don't think this feels dishonest to me.

stephen_cagle 4 hours ago | parent | prev | next [-]

One interesting point was the original PageRank algorithm greatly benefited from the fact that we kinda only had "text matching" search before Google (my memory was AltaVista at the time).

Because text matching was so difficult to search with, whenever you went to a site, it would often have a "web of trust" at the bottom where an actual human being had curated a list of other sites that you might like if you liked this site.

So you would often search with keywords (often literals), then find the first site, then recursively explore the web of trust links to find the best site.

My suspicion has always been that Google (PageRank) benefited greatly from the human curated "web of trust" at the bottom of pages. But once Google came out, search was much better, and so human beings stopped creating "web of trust" type things on their site.

I am making the point that Google effectively benefited from the large amount of human labor put into connecting sites via WOT, while simultaneously (inadvertently) destroying the benefit of curating a WOT. This means that by succeeding at what they did, they made it much more difficult for a Google#2 to come around and run the exact same game plan with even the exact same algorithm.

tldr; Google harvested the links that were originally curated by human labor, the incentive to create those links are gone now, so the only remaining "links" between things are now in the Google Index.

Addendum: I asked claude to help me think of a metaphor, and I really liked this one as it is so similar.

``` "The railroad and the wagon trails"

Before railroads, collective human use created and maintained wagon trails through difficult terrain. The railroad company could survey these trails to find optimal routes. Once the railroad exists, the wagon trails fall into disuse and the pathfinding knowledge atrophies. A second railroad can't follow trails that are now overgrown. ```

keeda an hour ago | parent [-]

> I am making the point that Google effectively benefited from the large amount of human labor...

This is exactly right, but the thing most people miss is that Google has been using human intelligence at massive scale even to this day to improve their search results.

Basically, as people search and navigate the results, Google harvests their clicks, hovers, dwell-time and other browsing behavior to extract critical signals that help it "learn" which pages the users actually found useful for the given query. (Overly simplified: click on a link but click back within a minute to go to the next link -> downrank, but spend more time on that link -> uprank.)

This helps it rank results better and improve search overall, which keeps people coming back and excluding competitors. It's like the web of trust again, except it's clicks of trust, and it's only visible to Google and is a never-ending self-reinforcing flywheel!

And if you look at the infrastructure Google has built to harvest this data, it is so much bigger than the massive index! They harvest data through Chrome, ad tracking, Android, Google Analytics, cookies (for which they built Gmail!), YouTube, Maps and so much more.

So to compete with Google Search, you don't need just a massive index, you also need the extensive web infra footprint to harvest user interactions at massive scale, which means the most popular and widely deployed browser, mobile OS, ad tracking, analytics script, email provider, maps, etc, etc.

This also explains why Google spent so many billions in "traffic acquisition costs" (i.e. payments for being the Search default) every year, because that was a direct driver to both, 1) ad revenue, and 2) maintaining its search quality.

This wasn't really a secret, but it (rightfully) turned out to be a major point in the recent Antitrust trial, which is why the proposed remedies (a TFA mentions) include the sharing of search index and "interaction data."

weisnobody 3 hours ago | parent | prev | next [-]

I think the crawled data should have to be shared, but I'm not convinced that Google should have to share their index.

It may be impracticable to share the crawled data, but from the stand point of content providers, having a single entity collecting the information (rather than a bunch of people doing) would seem to be better for everyone. Likely need to have some form of robots.txt which would allow the content provider to indicate how their content could be used (i.e research, web search, AI, etc.).

The people accessing the crawled data would end up paying (reasonable) fees to access the level of data they want, and some portion of that fee would go to the content provider (30% to the crawler and 70% to the crawler? :P maybe).

Maybe even go so far as to allow the Paywalled content providers to set a price on accessing their data for the different purposes. Should they be allowed to pick and choose who within those types should be allowed (or have it be based on violations of the terms of access)

It seems in part the content providers have the following complaints:

  * Too many crawlers (see note below re crawlers)
  * Crawlers not being friendly
  * Improper use of the crawled data
  * Not getting compensated for their content

Why not the index? The index, to me, is where a bunch of the "magic" happens and where individual companies could differentiate themselves from everyone else.

Why can't Microsoft retain Bing traffic when it's the default on stock Windows installs?

  * Do they not have enough crawled data?  
  * Their index isn't very good?
  * Their searching their index isn't good
  * The way they present the data is bad?
  * Google is too entrenched?
  * Combination of the above?

There are several entities intending to crawl all / large portions of the Internet: Baidu, Bing, Brave, Google, DuckDuckGo, Gigablast, Mojeek, Sogou and Yandex [1]. That does not include any of the smaller entities, research projects, etc.

[1] https://en.wikipedia.org/wiki/Search_engine#2000s–present:_P... (2019)

sharpshadow 3 hours ago | parent | prev | next [-]

If Google provides a Search Index it will be the censored version therefore still politically acceptable. The “Layer 1” idea will not happen.

direwolf20 2 hours ago | parent [-]

That's why Kagi combines results from multiple sources, just as it does with Yandex.

yomismoaqui 4 hours ago | parent | prev | next [-]

One thing I have discovered after using AI chats that include a websearch tool is that I don't want to delve on diferent blogs, Medium posts, Stack overflow threads with passive-aggresive mod comments, dismissing cookie banners... Sorry I just want the info I'm looking for, I don't care for your personal expression or need to monetize your content.

There are other times (usually not work related) when I want to explore the web and discovering some nice little blog or special corner on the net. This is what my RSS feed reader is for.

kqr 4 hours ago | parent [-]

With Kagi you can opt in to an LLM summary of the search result by appending a question mark to the query. It's a neat mechanism when it works!

user3939382 4 hours ago | parent | prev | next [-]

For anyone not acquainted Kagi is excellent and the people who work there strike me as nice and competent. I’m a harsh critic usually. Highly recommended.

flkiwi 3 hours ago | parent [-]

I've gotten more value out of it than just about any ongoing subscription I have. It's clean, fast, deeply customizable (i.e., excluding "answers" websites or any other domain you never want to see again), and, for what it is, inexpensive. Honestly if Google (or Bing) worked like Kagi does, I'd trade some of the privacy for the utility.

jeffbee 4 hours ago | parent | prev | next [-]

"We will simply access the index" has always struck me as wild hand-waving that would instantly crumble at first contact with technical reality. "At marginal cost" is doing a huge amount of work in this article.

nige123 4 hours ago | parent | prev | next [-]

The user data (anonymised) and analytics also needs to be shared.

the_arun 4 hours ago | parent | prev | next [-]

If google is serving 90% traffic & others are unable to enter - Doesn't that mean google is doing something right for the customer and others are unable to outcompete it? Isn't this how life works?

CGMthrowaway 4 hours ago | parent | next [-]

Google is allowed to be big, be better and win users. But happy customers is not the full test of monopolization. The real question is, "Could a meaningfully better search engine realistically displace Google today?” If the answer is no, then competition is broken

xnx 3 hours ago | parent [-]

> "Could a meaningfully better search engine realistically displace Google today?”

ChatGPT clearly demonstrated that displacing Google is possible. All previous monopoly arguments seemed even more flimsy after that.

b3kart 2 hours ago | parent [-]

I think you’re proving the monopoly argument yourself: if they only way to compete with Google is an innovation that generations of scientists have been working towards, it does paint a grim picture of competition in this space. Besides, are we ignoring Gemini?

charcircuit an hour ago | parent [-]

Google already used AI and language models before ChatGPT came out. If you wanted a state of the art search / recommendation engine you needed that innovations from scientists already.

rafterydj 4 hours ago | parent | prev | next [-]

This is a woefully naive view on the nature of monopolies. You could have made the same argument for Standard Oil.

hamdingers 3 hours ago | parent | prev | next [-]

Is the user's choice to use google a meaningful one when they're effectively the only game in town?

giantrobot 3 hours ago | parent | prev | next [-]

Google must be right for the customer because Google pays billions of dollars to be the default search engine for all the major browsers. And end users are notorious for changing application defaults.

soiltype 4 hours ago | parent | prev [-]

...No. Not at all. Not in the case of Google and generally that's not "how life works". If it was true, why would Google spend so much money to be the default search engine in so many devices/browsers?

WhereIsTheTruth 4 hours ago | parent | prev | next [-]

Kagi's "waiting for dawn" is just waiting for Google to legitimize their reseller business

Meanwhile, users pay a premium to pretend they're not using Google

Fascinating delusion

b3kart 4 hours ago | parent [-]

> Meanwhile, users pay a premium to pretend they're not using Google

My searches can’t be tied to me by Google for their ad targeting: this is worth paying a premium for, and I am glad Kagi are providing this service.

You seem to have a very limited understanding of the value Kagi provides.

yuugha1838 35 minutes ago | parent [-]

I have a limited understanding of the value Christianity provides. That neither means that Christianity provides no value, nor does it mean that God exists.

OGEnthusiast 5 hours ago | parent | prev | next [-]

Sounds like we need a nationalized search engine company then?

browningstreet 4 hours ago | parent [-]

I wouldn't trust a nationalized search engine company.

That said, there are projects like Common Crawl and in Europe, Ecosia + Qwant.

I personally would like to see a search enginge PaaS and a music streaming library PaaS that would let others hook up and pay direct usage fees.

NitpickLawyer 4 hours ago | parent | next [-]

> and in Europe, Ecosia

I tried. It's just not good enough. Quick example: yesterday I set up a workstation with Ubuntu, wanting to try out wayland. One of the things I wanted was to run an app (w/ gui) from another (unprivileged) user under my own user. Ecosia gave me bad old stuff. Tried for a few minutes, nothing useful. Switched to google, one of the first results was about waypipe. Searched waypipe on ecosia. 1 and a half pages of old content. Glaringly, not one of those results was the ubuntu.manpages entry on waypipe. shrug

shadowgovt 4 hours ago | parent | prev [-]

An interoperable search index access standard might work. We've done something similar for peering and the backbone of the IP-layer interconnects themselves.

direwolf20 2 hours ago | parent [-]

You have to make it economically preferable, and there's No known solution to this. Large networks are still using their positions to bully smaller ones off the IP-layer internet backbone.

hsuduebc2 4 hours ago | parent | prev | next [-]

It is even worse that the Google search become shit in last years. So they gate keep only relevant information for themselves and not using them with intent to improve search quality. As always if you have no competition your innovation goes only towards cost reduction. Not product improvement.

warkdarrior 3 hours ago | parent [-]

If Google Search is shit, why does Kagi want access to it?

JaggedJax 3 hours ago | parent [-]

They want access to the index. They will perform their own sorting to determine the best results to show from that index.

b3kart 3 hours ago | parent [-]

…without having advertiser interests to cater to.

ares623 4 hours ago | parent | prev [-]

Kagi should start building an index of sites that are trying to escape the current slop internet. It’s know they have the Small Web thing. But I’d like to see an index of a “neo internet” that blocks Google et al.

z64 2 hours ago | parent [-]

I've been tossing around the very early idea of seeing what we can do to elevate alcoves of the web such as Gemini[1] through Kagi. I am slightly conscious of that some people might not like us operating in that space, it's been on my TODO to poll people about it and take a quick pulse. I love the tech and think we could give it meaningful exposure.

Is this along the lines of what you have in mind - any other active efforts you're aware of that you think we should look into?

[1] https://en.wikipedia.org/wiki/Gemini_(protocol)

freediver 2 hours ago | parent [-]

Relevant https://github.com/kagisearch/smallweb/pull/425