Remix.run Logo
saltysalt 4 hours ago

I built my own web search index on bare metal, index now up to 34m docs: https://greppr.org/

People rely too much on other people's infra and services, which can be decommissioned anytime. The Google Graveyard is real.

orf 4 hours ago | parent | next [-]

Number of docs isn’t the limiting factor.

I just searched for “stackoverflow” and the first result was this: https://www.perl.com/tags/stackoverflow/

The actual Stackoverflow site was ranked way down, below some weird twitter accounts.

saltysalt 4 hours ago | parent [-]

I don't weight home pages in any way yet to bump them up, it's just raw search on keyword relevance.

dredmorbius 2 hours ago | parent | next [-]

Google's entire (initial) claim-to-fame was "PageRank", referring both to the ranking of pages and co-founder Larry Page, which strongly prioritised a relevance attribute over raw keyword findings (which then-popular alternatives such as Alta Vista, Yahoo, AskJeeves, Lycos, Infoseek, HotBot, etc., relied on, or the rather more notorious paid-rankings schemes in which SERP order was effectively sold). When it was first introduced, Google Web Search was absolutely worlds ahead of any competition. I remember this well having used them previously and adopted Google quite early (1998/99).

Even with PageRank result prioritisation is highly subject to gaming. Raw keyword search is far more so (keyword stuffing and other shenanigans), moreso as any given search engine begins to become popular and catch the attention of publishers.

Google now applies other additional ordering factors as well. And of course has come to dominate SERP results with paid, advertised, listings, which are all but impossible to discern from "organic" search results.

(I've not used Google Web Search as my primary tool for well over a decade, and probably only run a few searches per month. DDG is my primary, though I'll look at a few others including Kagi and Marginalia, though those rarely.)

<https://en.wikipedia.org/wiki/PageRank>

"The anatomy of a large-scale hypertextual Web search engine" (1998) <http://infolab.stanford.edu/pub/papers/google.pdf> (PDF)

Early (1990s) search engines: <https://en.wikipedia.org/wiki/Search_engine#1990s:_Birth_of_...>.

saltysalt 2 hours ago | parent [-]

PageRank was an innovative idea in the early days of the Internet when trust was high, but yes it's absolutely gamed now and I would be surprised if Google still relies on it.

Fair play to them though, it enabled them to build a massive business.

snowwrestler 28 minutes ago | parent | next [-]

Google’s biggest search signal now is aggregate behavioral data reported from Chrome. That pervasive behavioral surveillance is the main reason Apple has never allowed a native Chrome app on iOS.

It’s also why it is so hard to compete with Google. You guys are talking about techniques for analyzing the corpus of the search index. Google does that and has a direct view into how millions of people interact with it.

marginalia_nu 2 hours ago | parent | prev [-]

Anchor text information is arguably a better source for relevance ranking in my experience.

I publish exports of the ones Marginalia is aware of[1] if you want to play with integrating them.

[1] https://downloads.marginalia.nu/exports/ grab 'atags-25-04-20.parquet'

dredmorbius 39 minutes ago | parent | next [-]

Though I'd think that you'd want to weight unaffiliated sites' anchor text to a given URL much higher than an affiliated site.

"Affiliation" is a tricky term itself. Content farms were popular in the aughts (though they seem to have largely subsided), firms such as Claria and Gator. There are chumboxes (Outbrain, Taboola), and of course affiliate links (e.g., to Amazon or other shopping sites). SEO manipulation is its own whole universe.

(I'm sure you know far more about this than I do, I'm mostly talking at other readers, and maybe hoping to glean some more wisdom from you ;-)

marginalia_nu 14 minutes ago | parent [-]

Oh yeah, there's definitely room for improvement in that general direction. Indexing anchor texts is much better than page rank, but in isolation, it's not sufficient.

I've also seen some benefit fingerpinting the network traffic the websites make using a headless browser, to identify which ad networks they load. Very few spam sites have no ads, since there wouldn't be any economy in that.

e.g. https://marginalia-search.com/site/www.salon.com?view=traffi...

The full data set of DOM samples + recorded network traffic are in an enormous sqlite file (400GB+), and I haven't yet worked out any way of distributing the data yet. Though it's in the back of my mind as something I'd like to solve.

saltysalt 2 hours ago | parent | prev [-]

Very interesting, and it is very kind of you to share your data like that. Will review!

pjc50 an hour ago | parent | prev | next [-]

Confluence search does this, for our intranet. As a result it's barely usable.

Indexing is a nice compact CS problem; not completely simple for huge datasets like the entire internet, but well-formed. Ranking is the thing that makes a search engine valuable. Especially when faced with people trying to game it with SEO.

orf 3 hours ago | parent | prev [-]

Sure, but the point is results are not relevant at all?

It’s cool though, and really fast

saltysalt 3 hours ago | parent | next [-]

I'll work on that adjustment, it's fair feedback thanks!

direwolf20 3 hours ago | parent [-]

Unfortunately this is the bulk of search engine work. Recursive scraping is easy in comparison, even with CAPTCHA bypassing. You either limit the index to only highly relevant sites (as Marginalia does) or you must work very hard to separate the spam from the ham. And spam in one search may be ham in another.

saltysalt 2 hours ago | parent [-]

I limit it to highly relevant curated seed sites, and don't allow public submissions. I'd rather have a small high-quality index.

You are absolutely right, it is the hardest part!

globular-toast 2 hours ago | parent | prev [-]

What do you mean they're not relevant? The top result you linked contained the word stackoverflow didn't it? It's showing you exactly what you searched for. Why would you need a search engine at all if you already know the name of the thing? Just type stackoverflow.com into your address bar.

I feel like Google-style "search" has made people really dumb and unable to help themselves.

orf 2 hours ago | parent [-]

the query is just to highlight that relevance is a complex topic. few people would consider "perl blog posts from 2016 that have the stack overflow tag" as the most relevant result for that query.

1718627440 an hour ago | parent | prev | next [-]

You should consider filtering by input language. Showing the same Wikipedia article in different languages is not helpful when I am searching in English. Also you may unify by entries by URL, it shows the same URL, just with different publish dates, which is interesting and might be useful, but should maybe be behind a toggle, as it is confusing at first.

jfindley 2 hours ago | parent | prev | next [-]

Unfortunately the index is the easy part. Transforming user input into a series of tokens which get used to rank possible matches and return the top N, based on likely relevence, is the hard part and I'm afraid this doesn't appear to do an acceptable job with any of the queries I tested.

There's a reason Google became so popular as quickly as it did. It's even harder to compete in this space nowadays, as the volume of junk and SEO spam is many orders of magnitude worse as a percentage of the corpus than it was back then.

1718627440 an hour ago | parent | prev | next [-]

The input on the results page doesn't work, you always need to return to the start page on which the browser history is disabled. That's just confusing behaviour.

tosti an hour ago | parent | prev | next [-]

This is pretty cool. Don't let the naysayers stop you. Taking a stab at beating Google at their core product is bravery in my book. The best of luck to you!

renegat0x0 4 hours ago | parent | prev | next [-]

I made also something for my own search needs. It's just an SQLite table of domains, and places. I have your search engine there also ;-)

https://github.com/rumca-js/Internet-Places-Database

Demo for most important ones https://rumca-js.github.io/search

saltysalt 4 hours ago | parent [-]

Thank you, will check it out!

johnofthesea 3 hours ago | parent | prev | next [-]

I tested it using a local keyword, as I normally do, and it took me to a Wikipedia page I didn’t know existed. So thanks for that.

saltysalt 3 hours ago | parent [-]

It will throw up weird and interesting results sometimes ;-)

lolive 2 hours ago | parent | prev [-]

Lol, a GooglePlus URL was mentionned on a webpage i browsed this week.#blastFromThePast