Remix.run Logo
ArXiv Declares Independence from Cornell(science.org)
469 points by bookstore-romeo 9 hours ago | 157 comments
frankling_ 6 hours ago | parent | next [-]

The recent announcement to reject review articles and position papers already smelled like a shift towards a more "opinionated" stance, and this move smells worse.

The vacuum that arXiv originally filled was one of a glorified PDF hosting service with just enough of a reputation to allow some preprints to be cited in a formally published paper, and with just enough moderation to not devolve into spam and chaos. It has also been instrumental in pushing publishers towards open access (i.e., to finally give up).

Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.

In my view, arXiv fulfills its function better the less power it has as an institution, and I thus have exactly zero trust that the split from Cornell is driven by that function. We've seen the kind of appeasement prose from their statement and FAQ [1] countless times before, and it's now time for the usual routine of snapshotting the site to watch the inevitable amendments to the mission statement.

"What positive changes should users expect to see?" - I guess the negative ones we'll have to see for ourselves.

[1] https://tech.cornell.edu/arxiv/

aimarketintel a minute ago | parent | next [-]

This is great news for anyone building tools on top of arXiv data. The API (export.arxiv.org/api/) is one of the best free academic data sources — structured Atom feed with full abstracts, authors, categories, and publication dates.

I've been using it as one of 9 data sources in a market research tool — arXiv papers are a strong leading indicator of where an industry is heading. Academic research today often becomes commercial products in 2-3 years.

queuebert an hour ago | parent | prev | next [-]

> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, ...

In my experience as a publishing scientist, this is partly because publishing with "reputable" journals is an increasingly onerous process, with exorbitant fees, enshittified UIs, and useless reviews. The alternative is to upload to arXiv and move on with your life.

groundzeros2015 an hour ago | parent [-]

That’s true. But that’s separate than the use in ML in Blockchain circles as a form of a marketing - using academic appearances.

jjk166 34 minutes ago | parent [-]

That sounds more like an issue of certain fields having crappy standards because the people in those fields benefit from crappy standards than an issue with the site they happen to host papers on.

groundzeros2015 19 minutes ago | parent [-]

I don’t buy “some fields are just more honorable”. Everyone uses publishing for personal gain.

But yes it’s a people problem, not an arxiv problem.

Aurornis 22 minutes ago | parent | prev | next [-]

> and with just enough moderation to not devolve into spam and chaos

arXiv has become a target for grifters in other domains like health and supplements. I’ve seen several small scale health influencers who ChatGPT some “papers” and then upload them to arXiv, then cite arXiv as proof of their “published research”. It’s not fooling anyone who knows how research work but it’s very convincing to an average person who thinks that that they’re doing the right thing when they follow sources that have done academic research.

I’ve been surprised as how bad and obviously grifty some of the documents I’ve seen on arXiv have become lately. Is there any moderation, or is it a free for all as long as you can get an invite?

hijodelsol 5 hours ago | parent | prev | next [-]

I came here to say something similar. As someone who works in a field that applies machine learning but is not purely focused on it, I interact with people who think that arXiv is the only relevant platform and that they don't need to submit their work to any journal, as well as people who still think that preprints don't count at all and that data isn't published until it's printed in an academic journal. It can feel like a clash of worlds.

I think both sides could learn from the other. In the case of ML, I understand the desire to move fast and that average time to publication of 250-300 days in some of the top-tier journals can feel like an unnecessary burden. But having been on both sides of peer review, there is value to the system and it has made for better work.

Not doing any of it follows the same spirit as not benchmarking your approach against more than maybe one alternative and that already as an after-thought. Or benchmaxxing but not exploring the actual real-world consequences, time and cost trade offs, etc.

Now, is academic publishing perfect? Of course not, very very far from it. It desperately needs to be reformed to keep it economically accessible, time efficient for both authors, editors and peer reviewers and to prevent the "hot topic of the day" from dominating journals and making sure that peer review aligns with the needs of the community and actually improves the quality of the work, rather than having "malicious peer review" to get some citations or pet peeves in.

Given the power that the ML field holds and the interesting experiments with open review, I would wish for the field to engage more with the scientific system at large and perhaps try to drive reforms and improve it, rather than completely abandoning it and treating a PDF hosting service as a journal (ofc, preprints would still be desirable and are important, but they can not carry the entire field alone).

bonoboTP 4 hours ago | parent [-]

Simply anticipating basic push backs from reviewers makes sure that you do a somewhat thorough job. Not 100% thorough and the reviews are sometimes frivolous and lazy and stupid. But just knowing that what you put out there has to pass the admittedly noisily gatekept gate of peer review overall improves papers in my estimation. There is also a negative side because people try to hide limitations and honest assessments and cherry pick and curate their tables more in anticipation of knee jerk reviewers but overall I think without any peer review, author culture would become much more lax and bombastic and generally trend toward engagement bait and social media attention optimized stuff.

The current balance where people wrote a paper with reviers in mind, upload it to Arxiv before the review concludes and keep it on Arxiv even if rejected is a nice balance. People get to form their own opinion on it but there is also enough self-imposed quality control on it just due to wanting it to pass peer review, that even if it doesn't pass peer review, it is still better than if people write it in a way that doesn't care or anticipate peer review. And this works because people are somewhat incentivized to get peer reviewed official publications too. But being rejected is not the end of the world either because people can already read it and build on it based on Arxiv.

bjourne 2 hours ago | parent [-]

I really am not sure about that: https://biologue.plos.org/wp-content/uploads/sites/7/2020/05...

The problem is that "optimizing for peer-review" is not the same thing as optimizing for quality. E.g., I like to add a few tongue-in-cheeks to entertain the reader. But then I have to worry endlessly about anal-retentive reviewers who refuse to see the big picture.

stared 4 hours ago | parent | prev | next [-]

> arXiv fulfills its function better the less power it has as an institution

It is an interesting instance of the rule of least power, https://en.wikipedia.org/wiki/Rule_of_least_power.

fidotron 2 hours ago | parent [-]

The irony of the TBL quotes there being the entire problem with the semantic web is the ontological tarpit that results due to the excessive expressive power of a general triple store.

PaulHoule 2 hours ago | parent [-]

Well, I’d argue that many things in the semweb are not expressive enough and lead to the misunderstandings we have.

People think, for instance, that RDFS and OWL are meant to SHACL people into bad an over engineered ontologies. The problem is these standards add facts and don’t subtract facts. At risk of sounding like ChatGPT: it’s a data transformation system not a validation system.

That is, you’re supposed to use RDFS to say something like

  ?s :myTermForLength ?o -> ?s :yourTermForLength ?o .
The point of the namespace system is not to harass you, it is to be able to suck in data from unlimited sources and transform it. Trouble is it can’t do the simple math required to do that for real, like

  ?s :lengthInFeet ?o -> ?s :lengthInInches 12*?o .
Because if you were trying OWL-style reasoning over arithmetic you would run into Kurt Gödel kinds of problems. Meanwhile you can’t subtract facts that fail validation, you can’t subtract facts that you just don’t need in the next round of processing. It would have made sense to promote SHACL first instead of OWL because garbage-in-garbage out, you are not going to reason successfully unless you have clean data… but what the hell do I know, I’m just an applications programmer who models business processes enough to automate them.

Similarly the problem of ordered collections has never been dealt with properly in that world. PostgreSQL, N1QL and other post-relational and document DB languages can write queries involving ordered collections easily. I can write rather unobvious queries by hand to handle a lot of cases (wrote a paper about it) but I can’t cover all the cases and I know back in the day I could write SPAQL queries much better than the average RDF postdoc or professor.

As for underengineering, Dublin Core came out when I worked at a research library and it just doesn’t come close in capability to MARC from 1970. Larry Masinter over at Adobe had to hack the standard to handle ordered collections because… the authors of a paper sure as hell care what order you write their names in. And it is all like that: RDF standards neglect basic requirements that they need to be useful and then all the complex/complicated stuff really stands out. If you could get the basics done maybe people would use them but they don’t.

light_hue_1 3 hours ago | parent | prev | next [-]

> Unfortunately, over the years, arXiv has become something like a "venue" in its own right, particularly in ML, with some decently cited papers never formally published and "preprints" being cited left and right. Consider the impression you get when seeing a reference to an arXiv preprint vs. a link to an author's institutional website.

This just isn't true. arXiv is not a venue. There's no place that gives you credit for arXiv papers. No one cares if you cite an arXiv paper or some random website. The vast vast majority of papers that have any kind of attention or citations are published in another venue.

contubernio 2 hours ago | parent [-]

A Fields medal was awarded based mainly on this paper never published elsewhere: https://arxiv.org/abs/math/0211159

auggierose 15 minutes ago | parent [-]

I think there is a misunderstanding here. Does arXiv count as a publication? Yes, pretty much anything that gives you a DOI does, for example Zenodo. Does it function as a reputable anything? No.

The paper you link to counts as a publication, but its reputation stands on its own, it has nothing to do with arXiv as a venue. Ideally, that's how it is for all papers, but it isn't, just by publishing in certain venues your paper automatically gets a certain amount of reputation depending on the venue.

ph4rsikal 5 hours ago | parent | prev [-]

My observation is that research, especially in AI has left universities, which are now focusing their research to a lesser degree on STEM. It appears research is now done by companies like Meta, OpenAI, Anthropic, Tencent, Alibaba, among many others.

PaulHoule 26 minutes ago | parent | next [-]

That's a specific field at a very specific time. In general there is a difference between research and development, you're going to expect the early work to be done in academia but the work to turn that into a product is done by commercial organizations.

You get ahead as an academic computer scientist, for instance, by writing papers not by writing software. Now there really are brilliant software developers in academic CS but most researchers wrote something that kinda works and give a conference talk about it -- and that's OK because the work to make something you can give a talk about is probably 20% of the work it would take to make something you can put in front of customers.

Because of that there are certain things academic researchers really can't do.

As I see it my experience in getting a PhD and my experience in startups is essentially the same: "how do you do make doing things nobody has ever done before routine?" Talk to people in either culture and you see the PhD students are thinking about either working in academia or a very short list of big prestigious companies and people at startups are sure the PhDs are too pedantic about everything.

It took me a long time of looking at other people's side projects that are usually "I want to learn programming language X", "I want to rewrite something from Software Tools in Rust" to realize just how foreign that kind of creative thinking is to people -- I've seen it for a long time that a side project is not worth doing unless: (1) I really need the product or (2) I can show people something they've never seen before or better yet both. These sound different, but if something doesn't satisfy (2) you can can usually satisfy (1) off the shelf. It just amazes me how many type (2) things stay novel even after 20 years of waiting.

bonoboTP 4 hours ago | parent | prev [-]

Universities (outside a few) just have much weaker PR machines so you never hear what they do. Also their work is not user facing products so regular people, even tech power users won't see them.

0x3f 2 hours ago | parent [-]

Not sure about that. How would a university test scaling hypotheses in AI, for example? The level of funding required is just not there, as far as I know.

oscaracso 2 hours ago | parent | next [-]

Universities are also not suited to test which race car is the fastest, but that does not obviate the need for academic research in mechanical engineering.

0x3f 2 hours ago | parent [-]

Perhaps but the fastest race car is not possibly marshalling in the end of human involvement in science, so you might consider these of considerably different levels of meriting the funding.

oscaracso an hour ago | parent [-]

>marshalling in the end of human involvement in science

Good riddance! But not relevant in the least.

0x3f an hour ago | parent [-]

Impact size is not relevant to funding allocation?

bonoboTP an hour ago | parent | prev | next [-]

There are a million other research things to do besides running huge pretraining runs and hyperparam grid search on giant clusters. To see what, you can start with checking out the best paper and similar awards at neurips, cvpr, iccv, iclr, icml etc.

rsfern 2 hours ago | parent | prev [-]

This issue of accessibility is widely acknowledged in the academic literature, but it doesn’t mean that only large companies are doing good research.

Personally I think this resource mismatch can help drive creative choice of research problems that don’t require massive resources. To misquote Feynman, there’s plenty of room at the bottom

swiftcoder 4 hours ago | parent | prev | next [-]

> raised concerns about the proposed $300,000 salary for arXiv’s new CEO, saying it seemed high

Is a mid-to-high engineering salary outlandish for a CEO of what is likely to be a fairly major non-profit? Even non-profits have to be somewhat competitive when it comes to salary, and the ideal candidate is likely someone who would be balancing this against a tenured position at a major university

mort96 3 hours ago | parent | next [-]

Salaries in the US are so bonkers. Everywhere else outside of the US, $300,000 is an outlandish high salary. To call it "mid to high" is insane.

swiftcoder 3 hours ago | parent | next [-]

Even in the states, it’s more a distortion caused by the big tech centres. A software engineer in Ohio doesn’t command that kind of salary, but in San Francisco or Seattle that’ll buy you a moderately-senior engineer.

And while academic salaries are generally not great, tenured professors at big universities tend to make a fair bit (plus a lot more vacation time and perks than is normal in the US)

philipallstar an hour ago | parent [-]

It's also caused by progressive tax rates. People take harder jobs based on net wage, not gross wage, so gross wage has to compensate.

segmondy 10 minutes ago | parent | prev | next [-]

Everyone outside the US doesn't deal with USD. Your comment is bonkers. Read up on purchasing power. All locations are not equal.

ZpJuUuNaQ5 35 minutes ago | parent | prev | next [-]

>Salaries in the US are so bonkers.

Sure, but the cost of living there is significantly higher as well. Anyway, I can hardly even comprehend these kinds of sums, though I am a bit of an outlier, as I earn around $27,700 as an SWE in Europe, which is low even by the standards of companies in my own country.

groundzeros2015 44 minutes ago | parent | prev | next [-]

Note that you are seeing an explicit tradeoff of different economic systems.

0x3f 3 hours ago | parent | prev | next [-]

Not everywhere. Switzerland exists. Also cost of living is a thing so if anything US/CH just ramp up to match that. The rest of Europe has high CoL but terrible salaries. Asia has bad salaries but low CoL (on average).

mort96 2 hours ago | parent [-]

According to swissdevjobs.ch[1], the top 10% salary for a senior software developer in Switzerland is 135,000 swiss franc; that's roughly $170,000 per year.

So if this is correct, then even in Switzerland, it seems like $300,000 per year would be an obscenely high salary for a senior developer.

[1]: https://swissdevjobs.ch/salaries/all/all/Senior

0x3f 2 hours ago | parent [-]

Well first of all it's a CEO position, not an SWE :)

Even if we scope it to SWE, I don't think that's far off the US percentiles.

In London I imagine the top 10% SWE is not even 100k GBP. In Germany even worse.

mort96 2 hours ago | parent [-]

I responded to the idea that $300,000/year is a "mid-to-high engineering salary". CEO salaries are absurdly high everywhere.

0x3f 2 hours ago | parent [-]

Oh right, well it depends on CoL doesn't it? You can reframe European salaries as 'obscene' by world standards too. Both the US and Europe have totally broken and unaffordable housing markets, for example, but at least the Bay Area compensates with salary. I would say that relative to costs it's more that other salaries are obscenely low, if anything. People in Europe should be rioting, but unfortunately only the home owners are politically active.

mort96 2 hours ago | parent [-]

Does cities like San Francisco not have janitors? Waiters? Food delivery drivers? Or do those jobs command a six-figure salary too? If they can live comfortably in the city on a five-figure salary, maybe the argument that "cost of living is so high in SF that you can't live without a $300,000/year salary" is just a little bit overblown?

I can not imagine what one could possibly need $300,000 per year for unless an apartment costs like $200,000 per year.

swiftcoder 21 minutes ago | parent | next [-]

> Does cities like San Francisco not have janitors? Waiters?

When I used to visit the Meta campus in Menlo Park, the QA folk I worked with were commuting 2 hours each way just to be able to afford housing. I've no idea how far away the janitorial staff must have lived to do the same

throw-the-towel an hour ago | parent | prev | next [-]

> I can not imagine what one could possibly need $300,000 per year for unless an apartment costs like $200,000 per year.

Being able to afford unpredictable expenses and not have it bankrupt you. In the US, that would include healthcare. Everywhere in the world, that would be useful if you were laid off.

mort96 39 minutes ago | parent [-]

To build an emergency fund, you just need an income that's a bit higher than your expenses. If you earn $60,000 after tax per year, and spend $50,000 per year, you have a decent $10,000 emergency fund after one year and a massive $100,000 emergency fund after a decade. You don't need $300,000 per year to save.

0x3f 2 hours ago | parent | prev [-]

You get by on a low salary by living with multiple people in the same apartment. Or you live far away and commute. Or both.

Not really a tenable long-term situation for a senior employee with plans to start a family. Family homes of decent size and area are literally millions of dollars.

mort96 2 hours ago | parent [-]

I guess I don't understand why programmers somehow deserve a better life than other people. Janitors deserve to start families too, don't they?

throw-the-towel an hour ago | parent | next [-]

Usually this kind of argument leads to punishing the programmers, not lifting up the janitors.

mort96 27 minutes ago | parent [-]

That's kind of two sides of the same coin, isn't it? The cost of living is so high in part because so many have ridiculously high salaries, isn't it?

swiftcoder 19 minutes ago | parent [-]

> The cost of living is so high in part because so many have ridiculously high salaries

Bigger problem in the SF area is that a bunch of folks who owned property before the gold rush have ended up real-estate-rich, and formed a voting block that actively prevents the construction of new housing (on the basis that it might devalue their accidental real estate investment)

0x3f 2 hours ago | parent | prev [-]

It's not about deserving, programmers just have enough market power to be able to choose to go elsewhere. Janitors and other more fungible employees do not.

Besides, I did already say that everyone else was underpaid relative to costs. But that's not unique to the Bay Area. Cost of housing relative to income is terrible in almost all of the major European cities too.

Once cities become wealthy enough to develop a home owning class, they seem to cease being able to provision adequate housing supply in general.

dev_l1x_be 3 hours ago | parent | prev | next [-]

So is the living cost. Insurance, housing, etc. A better comparison is PPP.

carlosjobim an hour ago | parent [-]

Living costs are similarly high in many places that have nowhere near the salaries of the US.

It's still the land of opportunities. It's easier to find ways to reduce your living costs than ways to increase your salary.

HappyPanacea 3 hours ago | parent | prev [-]

Yes the obvious play is to move human labor to cheaper countries like France (including CEO of course).

0x3f 3 hours ago | parent | next [-]

The net salary in France might be low but the overall cost of hiring is quite high. Besides, why go to the middle when you can just find even cheaper places, if that's your prime metric?

renewiltord 3 hours ago | parent | prev [-]

The reason the French can’t build these things is the same reason they shouldn’t be allowed to be in charge. It’s a preprint PDF host. Just make your own if you can run this one.

magnio 3 hours ago | parent [-]

They do have their own: https://hal.science/

It is actually quite common to come across HAL in subfields of mathematics in my experience.

bjourne 2 hours ago | parent [-]

HAL is decidedly second-tier. Given the option, everyone would pick arXiv over HAL. Hence, HAL hosts lots of stuff that didn't (even) make it to arXiv => lots of subpar dredge.

Miraltar an hour ago | parent [-]

> HAL is decidedly second-tier. Given the option, everyone would pick arXiv over HAL.

Can you elaborate on that?

Hendrikto 3 hours ago | parent | prev | next [-]

For anybody outside the SV, and especially outside the US, this seems high, yes.

arXiv does not need to and should not optimize for “shareholder value”, which is at least nominally the justification for outlandish CEO pay packages.

jjk166 13 minutes ago | parent | next [-]

$300k for a top executive position isn't especially high for anywhere in the US. That's around what the administrative director of a hospital would be making, which seems like a much smaller scope than leading ArXiv. For comparison, my roommate works for a non-profit that serves Philadelphia whose CEO's salary is $1.1 million. The CEO of the wikimedia foundation, which is similar in terms of role, has a salary of $450k. General average for US CEOs including for profits is around $800k and for large organizations tens of millions is not atypical.

Non-profits aren't maximizing stock value, but they do need to optimize for stakeholder value - you want to maximize the amount of money being donated in and you want to make the most of the donations you receive, both to advance the primary mission of the non-profit and to instill confidence in donors. This demands competent leadership. The idea that just because something is not being done for profit means the value of the person's contributions is worth less is absurd. So long as the CEO provides more than $300k of value by leading the organization, which might include access to their personal connections, then the salary is sensible.

kingstnap an hour ago | parent | prev [-]

arXiv doesn't need much. All they do is host static pdfs uploaded by someone else with free CDN services from Fastly [0]. I'm sure they could get academics to volunteer moderation services as well.

In reality you could host the entire thing for well under $50k/year in hardware and storage if someone else is providing a free CDN. Their costs could be incredibly low.

But just like Wikipedia I see them very likely very quickly becoming a money hole that pretends to barely be kept afloat from donations. All when in reality whats actually happening is that its a ridiculous number of rent seekers managed to ride the coattails of being the defacto preprint server for AI papers to land themselves cushy Jobs at a place that spends 90+% of their money on flights and hotels and wages for their staff.

I'm already expecting their financial reports to look ridiculously headcount heavy with Personnel Expenses, Meetings and Travel blowing up. As well as the classic Wikipedia style we spend a ton of money in unclear costs [1].

Whats already sad is they stopped having a real broken down report that used to actually showed things. Like look at this beautiful screenshot of a excel sheet. Imagine if Wikipedia produced anything this clear. [2]

[0] https://blog.arxiv.org/2023/12/18/faster-arxiv-with-fastly/

[1] https://info.arxiv.org/about/reports/FY26_Budget_Public.pdf

[2] https://info.arxiv.org/about/reports/2020_arXiv_Budget.pdf

OneDeuxTriSeiGo 4 minutes ago | parent [-]

> arXiv doesn't need much. All they do is host static pdfs uploaded by someone else with free CDN services from Fastly [0]. I'm sure they could get academics to volunteer moderation services as well.

This just isn't true. arXiv nowadays has to deal with major moderation demands due to the influx of absolute drivel, spam, and slop that non-academics and less-than-quality academics have been uploading to the site.

Moderation for arXiv isn't perfect or comprehensive but they put so much work into trying to keep the worst of the content off their site. At this point while they aren't doing full blown peer review, they are putting a lot of work into providing first pass moderation that ensures the content in their academic categories is of at least some level of respectable academic quality.

HappyPanacea 3 hours ago | parent | prev [-]

arXiv's CEO doesn't need to be a tenured professor equivalent it is a preprint repository ffs.

0x3f 3 hours ago | parent [-]

It's a bit more complex than an S3 bucket though because the value comes from the reputation network, which can't really be replicated easily.

Though, saying that, I suppose all the reputation data is kind of public. Apart from emails/accounts.

groundzeros2015 41 minutes ago | parent [-]

> It's a bit more complex than an S3 bucket

It’s even less. I would bet if it’s not now, for the vast majority of its life it was a machine at someone’s desk at Cornell.

losvedir 3 minutes ago | parent | prev | next [-]

arXiv is great. It's just a problem that there's so much slop. What if arXiv offered a subscription service that people in different fields could use to just see a curated selection of the top papers in their field each month. Established researchers in each field could then review some of the preprints for putting into the curated monthly list.

Oh, wait.

halperter 8 hours ago | parent | prev | next [-]

Statement by arXiv: https://tech.cornell.edu/arxiv/

reed1234 8 hours ago | parent [-]

Should be the main link. The original article is based on the CEO job posting.

taormina 32 minutes ago | parent | prev | next [-]

Given that Cornell charges what, $50k a year as an Ivy League, $300k feels like almost nothing.

PaulHoule 25 minutes ago | parent [-]

This is going to be in NYC where $300k does not go as far as it does in Ithaca.

psalminen 8 hours ago | parent | prev | next [-]

I might be missing something, but I still don't get the why. I don't see any "problem" that needs to be solved.

kolinko 7 hours ago | parent | next [-]

The article lists the reasons quite clearly.

binsquare 6 hours ago | parent [-]

For everyone else,

The reason is because arxiv is growing significantly leading to 297,000 deficit in operating costs for 2025 alone. Corenell has helped with donation a long with other organizations that pay membership fees.

As a result, donors + leaders of arxiv think it's best to spin off to increase funding.

sanex 44 minutes ago | parent | next [-]

Now they're going to have a deficit of 600,000 in operating costs.

vl 5 hours ago | parent | prev [-]

What is unclear why they need stuff of 27 and 6.7 million to operate essentially static hosting website in 2026.

swiftcoder 4 hours ago | parent | next [-]

The "essentially static hosting" isn't the cost centre (although with 5 million MAU, it's nothing to sneeze at). The real costs are on the input side - they have an ingestion pipeline that ensures standardised paper formatting and so on, plus at least some degree of human review.

bonoboTP 4 hours ago | parent | next [-]

Do you mean that the CPU compute cost of turning latex into pdf/HTML is the main cost?

swiftcoder 4 hours ago | parent [-]

No, I mean that the pipeline requires software engineers to build/maintain, and salaries are (as in basically every tech organisation) the dominant cost

bonoboTP 4 hours ago | parent [-]

Then drop it and make people upload a pdf and a zip of the latex sources.

Most people I talk to hate that pipeline and spend a lot of debug hours on it when Arxiv can't compile what overleaf and your local latex install can.

domoritz 2 hours ago | parent [-]

Arxiv can recompile latex to support accessibility and html. Going to pdf submissions would be a major step backward.

bonoboTP 2 hours ago | parent [-]

Make it an external service then, and leave the thing that's already working great to just be.

The reason authors like and use arxiv is that it gives 1) a timestamp, 2) a standardized citable ID, and 3) stable hosting of the pdf. And readers like the no-nonsense single click download of the pdf and a barebones consistent website look.

All else is a side show.

lou1306 4 hours ago | parent | prev [-]

The PDF formatting is all but standardised. They ingest LaTeX sources, which is formatted according to the authors' whims (most likely, according to whatever journal or conference they just submitted the manuscript to). I'll concede that the (relatively novel) HTML formatter gives paper a more uniform appearance. They also integrate a bunch of external services for e.g., citation metrics and cross-references. Still hard to justify such a high cost to operate, but eh.

Also, the "human review" is a simple moderation process [1]. It usually does not dig into the submission's scientific merits.

[1] https://info.arxiv.org/help/moderation/index.html

OtherShrezzing 3 hours ago | parent | prev [-]

I don't see it as an especially exuberant structure or budget. I've seen larger teams with bigger budgets struggle to maintain smaller applications.

I've contracted into some consultancy teams which you could uncharitably describe as "15 people and $4mn/yr to create one PDF per month".

u1hcw9nx 7 hours ago | parent | prev [-]

I think the problem described in 6th paragraph needs to be solved.

contubernio 2 hours ago | parent | prev | next [-]

What is worrisome about this development, and corollary actions like the hiring of a CEO with a $300,000/year salary, is that the essentially independent and community based platform will disappear. The ArXiv exists because mathematicians and physicists, and later computer scientists and engineers, posted there, freely, their work, with minimal attention to licensing and other commercial aspects. It has thrived because it required no peer review and made interesting things accessible quickly to whomever cared to read them.

A setup as a US-based "non-profit" is worrisome, if only because 300K is an obscene salary even in a for-profit setting. That the US-based posters can't see this is evidence of the basic problem which is that the US, both left and right, has been taken over by a neoliberal feudal antidemocratic nativist mindset that is anathema to the sort of free interchange of ideas that underlay the ArXiv's development in the hands of mathematicians and physicists now swept aside and ignored by machine learning grifters and technicians who program computers.

doctorwho42 17 minutes ago | parent [-]

As a US based academic, I have to say when I saw the salary I immediately gawked. I think it's not americans but silicon valley-ites and tech bros on here who have lived with inflated salary/net worth that think it's just a middle of the road salary. As I regularly interact with friends in engineering who make like $200k + benefits ($), and I wonder why I don't jump ship to that weird land.

bonoboTP 4 hours ago | parent | prev | next [-]

I fear their Mozilla-ification and Wikipedia-ification. Scope creep, various outreach feel-good programs, ballooning costs, lost focus etc. And other types of enshittification.

Any change to the basic premise will be a negative step.

They should just be boring quiet unopininionated neutral background infrastructure.

Hendrikto 3 hours ago | parent | next [-]

> Mozilla-ification

All the Mozilla executives have done for the last 15+ years is

* lay off developers

* spend lots of money on stupid side projects nobody asked for or wants

* increase their own salaries

and all that with the backdrop of falling quality, market share, and relevance.

I would happily donate to Firefox, but this fucked up organization will never see a single cent from me. They will spend it on anything but Firefox, which is the only thing anybody wants them to spend it on.

It might already be too late, and we will be left with a browser monopoly.

swed420 36 minutes ago | parent | next [-]

> It might already be too late, and we will be left with a browser monopoly.

Ladybird continues to have the appearance of making progress, fwiw:

https://ladybird.org/newsletter/2026-02-28/

cge 31 minutes ago | parent | prev | next [-]

>They will spend it on anything but Firefox, which is the only thing anybody wants them to spend it on.

Mozilla certainly won’t spend it on Firefox, because the structure of the organization legally prohibits them from spending any of their donation money on Firefox. The ‘side projects’ are, at least officially, the real purpose of Mozilla.

bonoboTP 12 minutes ago | parent [-]

They built the brand on Firefox then did a bait and switch. How many people who donate to Mozilla know that it's not helping Firefox?

But yeah, this is just how it works. Things can't stay good for too long. One must always be on the lookout for the new small thing that's not yet corrupted. Stay with it for a while until it rots, then jump to the next replacement.

bonoboTP 2 hours ago | parent | prev [-]

And it is a risk for Arxiv too that once they start to drink the koolaid and start going to the same cocktail parties that these kinds of nonprofit board members and execs go to and will feel the need to prance around with some fancy stuff.

"oh no, you see we are not a preprint server host anymore, our mission is a values driven blablabla to make a meaningful change in the blablabla, we have spent X dollars to promote the blablabla, take me seriously please I'm also fancy like you! "

kergonath 3 hours ago | parent | prev [-]

> They should just be quiet unopininionated neutral background infrastructure.

Exactly. It should be a utility. Not quite dumb pipe, but not too far either.

doctorwho42 15 minutes ago | parent [-]

We don't do 'utility' in America. Everything has S.V. brain rot - it's mixed with wall street brain rot, and now if you aren't extracting wealth out of what you have access to - you are failing.

asimpleusecase 6 hours ago | parent | prev | next [-]

I wonder if there are plans to licence the content for AI training

mkl 4 hours ago | parent | next [-]

It's been available all along: https://info.arxiv.org/help/bulk_data.html

KellyCriterion 6 hours ago | parent | prev [-]

Id guess OAI & co have already copied without asking?

mkl 4 hours ago | parent [-]

No need to ask - the whole point is open access. https://info.arxiv.org/help/bulk_data.html

dataflow 8 hours ago | parent | prev | next [-]

This sounds terrible. Of course there's a huge risk of it becoming made for-profit. It almost makes you wonder if the academic publishers are behind this push somehow.

Could they not have made it into some legal structure that puts universities at the top? Say, with a bunch of universities owning shares that comprise the entirety of the ownership of arXiv, but that would allow arXiv to independently raise funds?

gucci-on-fleek 8 hours ago | parent [-]

> Of course there's a huge risk of it becoming made for-profit.

The article says that "it will become an independent nonprofit corporation", and as OpenAI's failed attempt showed, converting a non-profit to a for-profit organization is either really hard or impossible.

> Could they not have made it into some legal structure that puts universities at the top?

As a corporation (even a non-profit one), it will have a board of directors. I have no idea what their charter will look like, but I would be surprised if at least one seat wasn't reserved for a university representative, and more than that seems quite likely as well.

MostlyStable 8 hours ago | parent | next [-]

OpenAI didn't get everything that they wanted, but I very much disagree with calling it a "failed attempt". The non-profit went from owning the entirety of OpenAI to having ~25% stake.

ronsor 8 hours ago | parent | next [-]

Sam Altman is a special kind of person; not many could pull off the schemes he does.

gentleman11 7 hours ago | parent [-]

I doubt it was him who architected it. A team of lawful evil lawyers more likely

cbolton 5 hours ago | parent | prev | next [-]

The non-profit still controls the board doesn't it?

weedhopper 5 hours ago | parent [-]

As shown by Altman, not really.

gucci-on-fleek 8 hours ago | parent | prev [-]

Ah, thanks for the correction.

mort96 3 hours ago | parent | prev [-]

Is your argument really that "OpenAI was an independent nonprofit corporation and it worked out great, Arxiv will remain just as non-profit as OpenAI"?

gucci-on-fleek 3 hours ago | parent [-]

No, my argument is that OpenAI could make billions of dollars if they converted from a non-profit to a for-profit, and they only succeeded after years of effort and because they had already structured the company into separate for-profit and non-profit entities. And even after all this, the non-profit still controls the majority of the for-profit entity.

So if OpenAI with billions of dollars only partially succeeded at converting to a for-profit business, then that suggests that organizations with fewer resources (like arXiv) have much worse odds.

juped 2 hours ago | parent | prev | next [-]

>Cornell, for example, had a limited capacity to pay software developers to maintain and upgrade the site, which still has a very no-frills look and feel.

arXiv is doomed. It was nice while it lasted.

oscaracso an hour ago | parent [-]

I am not a software engineer, although I do write programs. What is it about digital infrastructure that requires maintenance? In the natural world, there is corrosion, thermal fluctuation, radiation, seismic activity, vandalism, whathaveyou. What are the issues facing the arxiv demanding the attention of multiple people 'round the clock?

bonoboTP an hour ago | parent [-]

They have to update the software stack, replace usage of deprecated APIs, support new latex packages etc. They could probably minimize these by limiting the scope but just keeping a small, tightly scoped software functional is always boring, people want to work on fun new features, they enjoy the brand recognition and feel like they should do more stuff.

I wonder when they will introduce the algorithmic feed and the social network features.

tornikeo 8 hours ago | parent | prev | next [-]

Now the question is, will arxiv wage a decade long bloody war with Cornell, using heavy infantry (PhD students), archers (reviewers) and field artillery (AI slop papers), or will the independence be mostly peaceful? Only time can tell.

alansaber 8 hours ago | parent [-]

PhD students are levy infantry at best with Postdocs being the armoured levies.

dmos62 6 hours ago | parent [-]

Is this Gondor or Mordor?

Aerolfos 5 hours ago | parent | prev | next [-]

And they hired a LinkedIn business idiot to run the new organization - so the aim is for an infinite growth tech startup in terms of governance, despite the technical legal status of non-profit. It shows in the language they use in the announcement, too ("improved financial viability in the long run")

OpenAI shows exactly how well that works and what that kind of governance does to a company and to its support of science and the commons.

TL;DR, it's fucked.

vedantxn 3 hours ago | parent | prev | next [-]

we got this before gta 6

Garlef 6 hours ago | parent | prev | next [-]

Maybe they should implement a graph based trust system:

You need your favourite academic gatekeeper (= thesis advisor) to vouch for you in order to be allowed to upload.

Then AI slop gets flagged and the shame spreads through the graph. And flaggings need to have evidence attached that can again be flagged.

justinnk 6 hours ago | parent | next [-]

They already had a basic form of this for a while [1]

> arXiv requires that users be endorsed before submitting their first paper to arXiv or a new category.

[1] https://info.arxiv.org/help/endorsement.html

pred_ 6 hours ago | parent | prev | next [-]

The endorsement system already works along that line: https://info.arxiv.org/help/endorsement.html

It's probably not perfect but in practice, it seems to have been enough to get rid of the worst crackpotty spam.

ryangibb 6 hours ago | parent | prev | next [-]

You mean like endorsement? https://info.arxiv.org/help/endorsement.html

dmos62 6 hours ago | parent | prev | next [-]

I've often thought that similar trust systems would work well in social media, web search, etc., but I've never seen it implemented in a meaningful way. I wonder what I'm missing.

IshKebab 6 hours ago | parent [-]

Lobsters has this I think. But it also means I've never posted there.

ChrisGreenHeur 5 hours ago | parent | prev [-]

Science reduced to people with a phd?

budman1 an hour ago | parent [-]

not a bad first order filter.

can you think of a better one?

OutOfHere 7 hours ago | parent | prev | next [-]

With 300K for the CEO, its enshittification will commence imminently. It will now serve to maximize revenue. Just wait and watch while they issue a premium membership, payment requirements for authors, and other revenue generators to please their investors.

exe34 7 hours ago | parent [-]

they'll just turn into a shitty journal at this point, they just need to introduce peer review and they can start competing with the real journals on price point.

another will need to rise to take its place.

OutOfHere 7 hours ago | parent [-]

> they'll just turn into a shitty journal at this point

To this end, they added an endorsement requirement this year: https://blog.arxiv.org/2026/01/21/attention-authors-updated-...

Peteragain 7 hours ago | parent | prev | next [-]

.. and soon to be dependent on US military funding? Controlled by someone who has run-ins with universities? This'll end in tears.

shevy-java 5 hours ago | parent | prev | next [-]

"Recently arXiv’s growth has accelerated. Since 2022, it has expanded its staff to 27, in large part to deal with a 50% increase in submitted manuscripts."

I am wary of that. IMO the business model is damaged therein. You can say in 2022 we had 27; bankrupt in 2030.

adamnemecek 8 hours ago | parent | prev | next [-]

Good call, ArXiv seems like one of the most important institutions out there right now.

kergonath 3 hours ago | parent | next [-]

The French government put a bit of money on the table to help researchers fulfil their open science requirements for government and EU grants, and funded the HAL repository ( https://hal.science/ ). It’s much smaller than arXiv, but it exists. In other countries like the UK there are clusters of smaller repositories as well, but it’s not as well centralised.

p-e-w 8 hours ago | parent | prev | next [-]

It’s so important, in fact, that there should be more than one such institution.

People keep falling into the same trap. They love monopolies, then are shocked when those monopolies jerk them around.

auggierose 8 hours ago | parent | next [-]

I am using Zenodo for a while now instead. It is more user friendly, as well.

Al-Khwarizmi 5 hours ago | parent | next [-]

I like it as well, it works great. But I wonder if it would scale if at some point there were a massive exodus from arXiv.

auggierose 4 hours ago | parent [-]

I think it already hosts much more data than arXiv, given that they also host large datasets.

mastermage 7 hours ago | parent | prev [-]

Zenodo is more for IT Papers and also datasets isn't it?

auggierose 6 hours ago | parent [-]

It can host large datasets as well, yes. It is hosted by CERN, so it is not specifically IT in any way. It also allows you to restrict access to the files of your submission. It has no requirements to submit your LaTeX sources, any PDF will be fine. There are also no restrictions on who can publish. You'll get a DOI, of course.

Everything published on arXiv could also be published on Zenodo, but not the other way around.

freehorse 6 hours ago | parent | prev | next [-]

It is just a preprint repository. It is pretty open (the stories where a preprint was rejected or delayed unreasonably are extremely rare). It offers the basic services for a math/compsci/physics themed preprint repository.

I don't see much of a monopoly, nor any "moat" apart from it being recognised. You can already post preprints on a personal website or on github, and there are "alternatives" such as researchgate that can also host preprints, or zenodo. There are also some lesser known alternatives even. I do not see anything special in hosting preprints online apart from the convenience of being able to have a centralised place to place them and search for them (which you call "monopoly"). If anything, the recognisability and centrality of arxiv helped a lot the old, darker days to establish open access to papers. There was a time when many journals would not let you publish a preprint, or have all kinds of weird rules when you can and when you can't. Probably still to some degree.

andbberger 8 hours ago | parent | prev [-]

there is. bioarxiv.

koakuma-chan 7 hours ago | parent | prev [-]

it just hosts pdfs, no?

aragilar 6 hours ago | parent | next [-]

It does do a fair amount of filtering of submissions, and it's a long term archive (e.g. for the next 100+ years). I suspect both (but with the former dominating) are the issue.

bonoboTP 4 hours ago | parent [-]

Just put out a torrent and people of the sort at r/DataHoarder will keep it alive for longer than bureaucrats.

freehorse 6 hours ago | parent | prev | next [-]

Well, technically, it can also compile your tex file if you upload the tex file instead of the pdf directly, which helps a lot in standardizing the stylistic structure between preprints. Most other repositories are wild west and inconsistent. I really appreciate the similarity in style applied to most preprints there. Moreover, this means you can also download not just the pdf, but the source tex file to, which can be very useful.

bonoboTP 4 hours ago | parent [-]

The similarity in style comes from conference and journal templates, not from Arxiv. You can style your paper with latex in any style, Arxiv doesn't care. On Arxiv you mostly see preprints that people submit to conferences and journals and they enforce the style.

pfortuny 6 hours ago | parent | prev | next [-]

Also the sources and has a very tame but useful pre-acceptance process.

IshKebab 6 hours ago | parent | prev [-]

Technically yes, socially no.

ACCount37 3 hours ago | parent | prev | next [-]

Frankly, the only beef I have with arXiv as is: its insistence on blocking AI access.

I had to tell my AI to set up an MCP for "fetch while bypassing arXiv's rate limit" so that it doesn't burn 40k tokens looking for workarounds every time it wants to look at a paper and gets hit with a "sorry, meatbags only" wall.

Very annoying, given how relevant arXiv papers are for ML specifically, and how many of papers there are. Can't "human flesh search" through all of them to pick the relevant ones for your work, and they just had to insist on making it harder for AIs to do it too.

davnicwil 7 hours ago | parent | prev [-]

Very unrelated to the article, but I think 'arXiv' as a brand is bad, and really detrimental to what the institution aims to accomplish.

That is, it's not readily parseable, it really gives an insider term vibe - like this isn't for you if you don't already know what it means or how you should read or say it. It sort of reminds me of the overuse of latin and latinate terms generally in the old professions and, well, the academy.

Just always struck me as being somewhat at odds with the goal.

john-titor 7 hours ago | parent | next [-]

I wonder what makes you feel that. I've been publishing preprints close to a decade on arxiv now and never had any particular feelings about it.

To me it's just a way to get out your work fast, so that there is already a trace of it on the Internets - nothing more and nothing less.

> That is, it's not readily parseable, it really gives an insider term vibe...

Isn't that normal with highly specialized research fields? I agree many papers could benefit from clearer wording, but working in a niche means you sometimes don't reach a broader audience

davnicwil 7 hours ago | parent [-]

It's an opinion, and you feeling no particular way about it is equally valid.

But I did justify and maybe to reword slightly, surely if one of the main drivers is opening up research, the brand name should be something that's less obscure and more accessible / understandable as to what it is on first sight?

Maybe arXiv evoking the word 'archive' with an ancient Greek twist does that for some, but it's clearly a bit cryptic for many, and if the point is to open up probably the brand should just be something much plainer.

aragilar 6 hours ago | parent | next [-]

No, it's to be a pre-print server. If someone doesn't know what that means, then they shouldn't be using arXiv.

davnicwil 6 hours ago | parent [-]

everyone has a first time they see a thing and don't yet know what it is.

Using a brand as a filter where you have to already know what it means to get it is exactly the opposite of what it's supposed to achieve.

Consider the most exclusive (successful) brands that exist. Even there, where exclusivity is a brand goal, none of them have this property of being obscure on first contact.

bonoboTP 4 hours ago | parent [-]

You usually get introduced to it by your academic supervisor or collaborators as a masters or PhD student. If you're a solo researcher who has made a significant contribution on the frontier of science, I'm sure you'll be able to understand how Arxiv works as well. Because I assume you have had some conversations with other experts in the field. If you're a full on autodidact with no contact to any other researchers in the field, well, maybe it's better if you chat with some other people in that field.

Its reasonable to have a tradeoff here to avoid cranks and now AI psychosis slop. You can still post on research gate and academia.edu or you own github page or webhosting.

Cordiali 3 hours ago | parent | prev [-]

I've never even connected the 'X' to the Greek letter chi. I just kinda accepted it as one of many groovy web 2.0 misspellings in search of a domain and trademark.

jltsiren 7 hours ago | parent | prev | next [-]

It's a classic story of someone having to pick a name quickly, which then gets established long before anyone who cares about branding is aware of its existence.

The original service didn't even have a name, only a description, and it was amusingly hosted at xxx.lanl.gov. But LANL wasn't really interested in it, and the founder eventually left for Cornell. At that point, the service needed a domain name, but archive.org was already taken.

And besides, the name has Ancient Greek influences. A similar Latinate term might be something like "archive".

bonoboTP 3 hours ago | parent | next [-]

I thought the X was an allusion to LaTeX.

davnicwil 6 hours ago | parent | prev [-]

Interesting, thanks for the context! Makes it more understandable as a choice.

nixon_why69 7 hours ago | parent | prev | next [-]

> like this isn't for you if you don't already know what it means

Isn't that actually kindof a good brand signal for a repo of very specialized papers? "Fun with learning" in comic sans wouldn't help credibility.

vasco 7 hours ago | parent | prev [-]

This the type of guy that will suggest paper.ly as a better name with a straight face and then we wonder why the internet is turning to shit