Remix.run Logo
dig1 5 days ago

> We are already underway on the arXiv CE ("Cloud Edition") project... replace the portion of our backends still written in perl and PHP...re-architect our article processing to be fully asynchronous,.. we can deploy via Kubernetes or services like Google Cloud Run...improve our monitoring and logging facilities

Why do I smell someone from G was there and sold them fancy cloud story (or they wanted VMs and reseller sold them CloudRun)? Anyway, goodbye simplicity and stability, hello exorbitant monthy costs for the same/less service quality. Would love to be wrong.

jstummbillig 5 days ago | parent | next [-]

I noticed that while everyone on hn is quite clever, we are regularly not clever enough to assume that other people in similar settings are just as clever, and recognize when they probably spent a lot more time thinking about an issue we just skim the headline of.

dogleash 4 days ago | parent | next [-]

> I noticed that while everyone on hn is quite clever

Are we though? I see a pockets of world-expert level knowledge, some reasonable shop talk, and quite a bit of really dumb nonsense that is contradicted by my professional experience. Or just pedestrian level wrong. I mostly shitpost.

I don't have an opinion about arxiv's hosting, but it does read to be one of those projects that includes cleaning up of long standing technical debt that they probably couldn't get funded if not for the flashy goal. The flashy goal will, regardless of it's own merits, also be credited for improvements that they should have made all along.

Moto7451 4 days ago | parent | prev | next [-]

I’m not sure of the perspective of the OP but his comment hits home in that a common theme for the past ten years of my career has been “let’s move something to X, Y, and Z because that’s what Google says you should do.”

Note that Google doesn’t outright define an architecture for anyone, but people who worked at Google who come in as the hot hire bring that with them. Ditto for other large employers of the day. One of my mentors had to deal with this when Yahoo was the big deal.

In some cases, when abstractions are otherwise correct, this hasn’t been a big deal for the software projects and companies I’ve been involved with. It’s simply “there’s an off the shelf well supported industry standard we can use so we can focus on our customer/end goal/value add/etc.” Using an alternative docker runtime “that Google recommends” (aka is suggested by Kubernetes) is just a choice.

Where people get bit and then look at this with a squint, is when you work at several places where on the suggestion of the new big hire from Google/Amazon/Meta/etc, software that runs just fine on a couple server instances and has a low and stable KTLO budget ends up being broken down into microservices on Kubernetes and suddenly needs a team of three and doesn’t provide any additional value.

The worst I’ve experienced is a company that ended up listing the cost of maintaining their fork of Kubernetes and development environment as a material reason for a layoff.

Google’s marketing arm also has made deals to help move people to Google Cloud from AWS. Where I am working now this didn’t work to plan on either side it seems so we’re staying dual cloud, a situation no one is happy about. Before my time there was an executive on the finance side that believed Google was always the company to bet on and didn’t see Amazon as more than a book store. Also money. Different type of hubris, different type of pressure, same technical outcome as a CTO that runs on a “well Google says” game plan.

At the end of the day, Google is a big place and employs a lot of people. You’re going to have a lot of individuals who experience hucksters trying to parlay Google experience into an executive or high ranked IC role and they’re going to lean on what they know. That has nothing to do with Google itself, but their attempts to pry people away from AWS are about the same flavor from my personal experience.

kadushka 5 days ago | parent | prev | next [-]

The people on both ends of that conversation (google vs cornell) are clever but the result will probably be enshittification.

keepamovin 5 days ago | parent | next [-]

Impossible to argue with HN's regression to entropy cynicism hehehe :)

jeffbee 5 days ago | parent | prev [-]

I'm torn on flagging comments that throw out "enshittification". Do you feel that this stands in for actual thought?

4 days ago | parent | next [-]
[deleted]
kadushka 4 days ago | parent | prev [-]

[flagged]

Analemma_ 4 days ago | parent [-]

I'm going to fling that response right back at you. "Enshittification" is not a generic term for "update I don't like", it describes a specific dynamic that happens when a company inserts itself as a middleman into a two-sided market. arXiv could get worse in other ways, but enshittification in particular can't happen to it, that's a category error.

I think you should offer better thoughts instead of mad-libbing in buzzwords where they don't apply. Enshittification is actually a useful concept, and I don't want it to go the way of "FUD", which had a similar trajectory in the later years of Slashdot where people just reduced it to a useless catch-all phrase whenever Microsoft said anything about anything.

kadushka 4 days ago | parent [-]

I'm using "enshittification" as defined here: https://www.merriam-webster.com/slang/enshittification

I believe there's a real chance of it happening here as a result of this transition. I personally experienced results of several of similar transitions over the course of my career. What I haven't experienced are problems with Arxiv that would motivate such a change. There might be actual problems they are trying to solve - but I still believe things will probably get worse as a result.

wordofx 5 days ago | parent | prev [-]

[flagged]

fuckbrownpeople 5 days ago | parent [-]

[dead]

specialp 5 days ago | parent | prev | next [-]

I don't think it is that. I work for an org with close ties to arXiv, and just like us they are getting a lot more demand due to AI crawling. As a primary source of information there is a lot of traffic. They do have technical issues from time to time due to this demand, and I think their stability is just due to the exceptional amount of effort they take to keep it going. They are also getting more submissions and interest.

Kubernetes does add complexity but it does add a lot of good things too. Auto scaling, cycling of unhealthy pods, and failover of failed nodes are some of them. I know there is this feeling here sometimes that cloud services and orchestrated containers are too much for many applications, but if you are running a very busy site like arXiv I can't see how running on bare metal is going to be better for your staff and experience. I don't think they are naive and got conned into GCP as the OP alludes to. They are smart people that are dealing with scaling and tech debt issues just like we all end up with at some point in our careers.

JackC 5 days ago | parent | next [-]

> I work for an org with close ties to arXiv, and just like us they are getting a lot more demand due to AI crawling

Funny, I also work on academic sites (much smaller than arXiv) and we're looking at moving from AWS to bare metal for the same reason. The $90/TB AWS bandwidth exit tariff can be a budget killer if people write custom scripts to download all your stuff; better to slow down than 10x the monthly budget.

(I never thought about it this way, but Amazon charges less to same-day deliver a 1TB SSD drive for you to keep than it does to download a TB from AWS.)

Imustaskforhelp 5 days ago | parent | next [-]

I don't understand, why don't you use cloudflare? Don't they have an unlimited egress policy with R1?

Its way more predictable in my opinion that you only pay per month a fixed amount to your storage, it can also help the fact that its on the edge so users would get it way faster than lets say going to bare metal (unless you are provisioning a multi server approach and I think you might be using kubernetes there and it might be a mess to handle I guess?)

sitkack 4 days ago | parent | next [-]

Regardless, if you are delivering PDFs, you should be using a CDN.

If crawling is a problem, 1 it is pretty easy to rate limit crawlers, 2 point them at a requestor pays bucket and 3, offer a torrent with anti leech.

mcmcmc 5 days ago | parent | prev [-]

Could have something to do with Cloudflare’s abhorrent sales practices.

keepamovin 5 days ago | parent [-]

Can you tell me more? I think my business needs some abhorrent sales practices. That's how it's done, right?

mcmcmc 4 days ago | parent [-]

One example

https://robindev.substack.com/p/cloudflare-took-down-our-web...

ryao 4 days ago | parent | next [-]

I suspect that is the result of this:

https://www.reddit.com/r/sales/comments/134u0mq/cloudflare_c...

They got rid of all of the “underperforming” sales people and hired new ones. That nightmare is the result. I suspect the higher the sales performance, the more likely they were doing things like this.

keepamovin 4 days ago | parent | prev [-]

Wow, okay. That's a little too extreme. How is cloudflare acting insecure when so large? hmmm, confused.

ryao 4 days ago | parent | prev [-]

The two are not comparable. The 1TB of transit at Amazon can be subdivided over many recipients, while the solid state drive is empty and only can be sent to one.

That said, I agree that transit costs are too high.

fc417fc802 4 days ago | parent [-]

So order multiple drives, transfer the data to them, and drop them in the mail to the client. That should always be the higher bandwidth option, but in a sane world it would also be less cost effective given the differences in amount of energy and sorts of infrastructure involved.

The reason to switch away from fiber should be sustained aggregate throughput, not transfer cost.

ryao 4 days ago | parent [-]

The other guy was also comparing them based on transfer cost. Given that 1TB can be divided across billions of locations, shipping physical drives is not a feasible alternative to transit at Amazon in general.

fc417fc802 4 days ago | parent [-]

I'm not trying to claim that it's generally equivalent or a viable alternative or whatever to fiber. That would be a ridiculous claim to make.

The original example cited people writing custom scripts to download all your stuff blowing your budget. A reasonable equivalent to that is shipping the interested party a storage device.

More generally, despite the two things being different their comparison can nonetheless be informative. In this case we can consider the up front cost of the supporting infrastructure in addition to the energy required to use that infrastructure in a given instance. The result appears to illustrate just how absurd the current pricing model is. Bandwidth limits notwithstanding, there is no way that the OPEX of the postal service should be lower than the OPEX of a fiber network. It just doesn't make sense.

ryao 4 days ago | parent [-]

That is true. I was imagining the AWS egress costs at my own work where things are going to so many places with latency requirements that the idea of sending hard drives is simply not feasible, even with infinite money and pretending the hard drives had the messages prewritten on them from the factory. Delivery would never be fast enough. Infinite money is not feasible either, but it shows just how this is not feasible in general in more than just the cost dimension.

CharlieDigital 5 days ago | parent | prev | next [-]

Sound like all they needed was a CDN if the problem is AI crawlers. Adding auto-scaling compute just increases costs faster.

specialp 5 days ago | parent | next [-]

CDN is one part of strategy to deal with load. But it is not the only solution unless your site is exclusively static content. Their search, APIs, submission pipelines, duplicate detectors and a lot of other things are not going to be powered by CDNs.

motorest 5 days ago | parent | next [-]

Thank you for the insight. It's very easy to prescribe simple solutions when we are oblivious to the actual problems being solved.

GTP 5 days ago | parent | prev | next [-]

But, under the assumption that the problem are indeed AI crawlers, of the things you listed only the search would be under increased load.

specialp 5 days ago | parent [-]

I am sure their pages are not entirely static either. The APIs are used by researchers and AI companies too. Also with search you end up having people trying to use it for RAG with AI. I have dealt with this all and there is no one dead simple solution to deal with things. AI crawlers are one part of things, but they also have increasing submissions, have to deal with AI generated spam papers, and all sorts of stuff. There's always this feeling here on HN that oh it is dead simple you just do "X" as if the people that are dealing with it don't know that.

coliveira 5 days ago | parent | prev | next [-]

All these services can be throttled to deal with AI. I don't see this as a justification. The idea that a service like arXiv should be run as a startup is, simply put, foolish.

Imustaskforhelp 5 days ago | parent | prev [-]

pardon me but cloudflare workers seem better for this approach.

If we can get for the fact that we require javascript to run it, aside from that. Cloudflare workers is literally the best single thing to happen at least to me. With a single domain, I have done so many personal projects for problems I found interesting and I built so many projects for literally free, no Credit card. No worries whatsoever.

I might ditch writing other languages for server based like golang even though I like golang more just because cloudflare workers exists.

anelson 4 days ago | parent [-]

I too am impressed by Cloudflare Workers’ potential.

However Workers supports WASM so you don’t necessarily have to switch to JavaScript to use it.

I wrote some Rust code that I run in Cloudflare Functions, which is a layer on top of Cloudflare Workers which also supports WASM. I wrote up the gory details if you’re interested:

https://127.io/2024/11/16/generating-opengraph-image-cards-f...

JavaScript is most definitely the path of least resistance but it’s not the only way.

miyuru 5 days ago | parent | prev | next [-]

they already use fastly.

https://blog.arxiv.org/2023/12/18/faster-arxiv-with-fastly/

5 days ago | parent | prev [-]
[deleted]
evrythingisfin 4 days ago | parent | prev | next [-]

GC seems like a bad choice to me. The GC CLI and tools aren’t terrible, but their services in-general have historically been not been friendly to use and their support involves average documentation and humans that talk at users instead of serve them. Google as a company is not what it was, either. A lot of their funding was driven from advertising, and between social media, streaming, and LLMs, that seems like it may be starting to dry up.

Azure? Microsoft as a company is still the choice of most IT departments due to its ubiquitousness and low cost barrier to entry. I personally wouldn’t use Azure if I had the choice, because it’s easy and cheap at the surface, with hell underneath (except for products based on other things, like AD which was just a nice LDAP server, or C# which was modelled after Java).

I’d have gone with AWS. EKS isn’t bad to setup and is solid once it’s up. As far as the health of Amazon itself, China entering the their space hasn’t significantly changed their retail business, though eventually they’ll likely be in price wars.

The greatest risk to any cloud provider I think would be a war that could force national network isolation or government taking over infrastructure. And the grid would go down for a good while and water would stop, so everyone would start migrating, drinking polluted water, then maybe stealing food. At that point, whether or not they chose GC doesn’t matter anymore.

surajrmal 4 days ago | parent | next [-]

Everything has pros and cons. Just because their calculus came out different from yours doesn't mean they made the wrong decision for their situation. Hundreds of thousands of organizations have made similar conclusions to arxiv.

Google cloud profitable these days and advertising or other income streams drying up will only entice Google to further invest in cloud to ensure they are more diversified. Google isn't going to go away overnight and cloud is perhaps the least risky business they operate in.

Kwpolska 4 days ago | parent | prev [-]

Isn't Azure noticeably more expensive than AWS?

sitkack 4 days ago | parent | prev | next [-]

They don't need K8S, containerization yes, but not K8S.

Funes- 5 days ago | parent | prev [-]

>AI crawling

Can you not reliably block crawlers in this day and age?

masklinn 5 days ago | parent | next [-]

AI crawlers are a plague, they are intentionally badly behaved and designed to be hard to flag without nuking legit traffic. That’s why projects like nepenthes exist.

specialp 5 days ago | parent | prev | next [-]

You can to some degree with Cloudflare and other solutions. But, do you want to block them all? AI is a very useful tool for people to discover information and summarize results. Especially in scholarly publishing where one would have to previously search on dumb keywords, and have to read loads of abstracts to find the research that pertains to their interests. So by blocking AI crawlers and bots completely, you are shutting off what will probably end up being the primary way people use your resource not too long from now. arXiv is a hub of research, and their mission is to make that research freely available to the world.

Imustaskforhelp 5 days ago | parent | prev [-]

Dude, I don't want to sound cloudflare advocate because my last 2 comments on this thread are just shilling cloudflare...

but I think cloudflare is the answer to this thing as well.. (Sorry if I am being annoying) (Cloudflare isn't sponsoring me, I just love their service so much)

londons_explore 5 days ago | parent | prev | next [-]

This....

I bet arXiv was run on server hardware costing under $10k before...

And now it'll end up costing $10k per month (with free credit from Google which will eventually go away and then arXiv will shut down or be forced to go commercial)

perihelions 5 days ago | parent | next [-]

arXiv budgets about $88,000/year for server costs as of 2019,

(pdf) https://info.arxiv.org/about/reports/arXiv_CY19_midyear.pdf

5 days ago | parent [-]
[deleted]
falcor84 5 days ago | parent | prev [-]

I assume they would still have the serving code they use now and if they do choose to go back to maintaining it on their own hardware they'll always have that option. It seems they just don't want that anymore.

gapan 5 days ago | parent | next [-]

That will be several years down the line, when everything has bit-rotten to death and nobody in the team remembers how the old setup worked.

londons_explore 4 days ago | parent | prev [-]

That's why they're being encouraged to use cloud run and all the other cloud functionality so a migration back becomes very hard.

sightbroke 5 days ago | parent | prev | next [-]

> This is a project to re-home all arXiv services from VMs at Cornell to a cloud provider (Google Cloud).

They are already using VMs but one of the things it'll do is:

> containerize all, or nearly all arXiv services so we can deploy via Kubernetes or services like Google Cloud Run

And further state:

> The modernization will enable: - arXiv to expand the subject areas that we cover - improve the metadata we collect and make available for articles, adding fields that the research community has requested such as funder identification - deal with the problem of ambiguous author identities - improve accessibility to support users with impairments, particularly visual impairments - improve usability for the entire arXiv community

notpushkin 5 days ago | parent [-]

Containers part I can understand. Why not spin up a tiny Docker Swarm (or k3s/k0s) cluster instead of straight out going to Google though?

kelnos 5 days ago | parent [-]

Because those other things require more maintenance effort to run.

Getting creative is often just a pain in the ass. Doing the standard things, walking the well-trod path, is generally easier to do, even if it may not be the cheapest or most hardware/software-efficient thing to do.

elif 5 days ago | parent | prev | next [-]

I think this is more likely a move to prevent Cornell funding from getting tied up with dictations about what gets published.

But in all likelihood someone was probably just like "we're tired of doing ops on 2 decades old stacks"

johann8384 5 days ago | parent [-]

It sounds like they are just using GCP. It doesn't really change that.

whatever1 5 days ago | parent | next [-]

Google cloud can easily move the the instance to a region that is not science/free speech hostile

elif 5 days ago | parent | prev [-]

Yea but GCP was state of the art 20 years ago, php+perl was already crufty

ordersofmag 5 days ago | parent [-]

So state of the art it wasn't even available yet (preview launch was in 2008).

elif 4 days ago | parent [-]

Sounds about right. I remember hearing about it first in a talk being given by Doug Crockford at my university around that time. It blew my mind. I thought it was like gcc for the Internet. It's kind of wild that in the interim we have experienced the complete rise and fall of mongodb, node.js, even today the react paradigm are all expressions of this tiny little functional scripting language..

johann8384 5 days ago | parent | prev | next [-]

Moving to K8s, adding in additional instrumentation, just sounds like some new folks took over or joined the project and are doing some renovations. Seems like pretty standard stuff, doesn't really seem as sinister as you make it to be.

moralestapia 5 days ago | parent | prev [-]

>we can deploy via Kubernetes

Oh noes ... they got scammed