Paper2Agent: Stanford Reimagining Research Papers as Interactive AI Agents

V__ 3 days ago | parent | next [-]

> Conventional research papers require readers to invest substantial effort to understand and adapt a paper's code, data, and methods to their own work [...]

But that's the point! If we take out the effort to understand, really understand something on a deeper level from even research, then how can there be anything useful build on top of it? Is everything going to loose any depth and become shallow?

▲

IanCal 3 days ago | parent | next [-]

They’re not talking about removing any effort to understand a paper, but to lower it for the same level of understanding.

If more effort required to reach the same understanding was a good thing we should be making papers much harder to read than they currently are.

Why are the specific things they are doing a problem? Automatically building pipelines and code described in a paper, checking it matches the reported results then being able to execute it for queries the user has - is that a bad thing for understanding?

	▲	aleph_minus_one 2 days ago \| parent [-]
		> They’re not talking about removing any effort to understand a paper, but to lower it for the same level of understanding. Much more damage is done if the understanding that you get is wrong.

▲

SecretDreams 3 days ago | parent | prev | next [-]

I'm just imaging someone trying to defend their PhD or comprehensive and only having surface level knowledge of the shit their AI bot has cited for them.

Fyi - this is actually happening right meow. And most young profs are writing their grants using ai. The biggest issue with the latter? It's hard to tell the difference with how many grants are just rehashing the same stuff over and over.

▲

ethin 3 days ago | parent | prev | next [-]

Isn't this also a problem given that ChatGPT at least is bad a summarizing scientific papers[1]? Idk about Claude or Gemenai with that though. Still a problem.

Edit: spelling.

[1]: https://arstechnica.com/ai/2025/09/science-journalists-find-...

▲

andai 3 days ago | parent [-]

This study seemed to be before the reasoning models came out. With them I have the opposite problem. I ask something simple and it responds with what reads like a scientific paper.

▲

ijk 2 days ago | parent [-]

Of course "reads like" is part of the problem. The models are very good at producing something that reads like the kind of document I asked for and not as good at guaranteeing that the document has the meaning I intended.

▲

andai a day ago | parent [-]

That is true. What I meant was, I'll ask it for some practical problem I'm dealing with in my life, and it will start talking about how to model it in terms of a cybernetic system with inertia, springs and feedback loops.

Not a bad line of thinking, especially if you're microdosing, but I find myself turning off reasoning more frequently that I'd expected, considering it's supposed to be objectively better.

	▲	ijk a day ago \| parent [-]
		I find that for more "intuitive" evaluations, reasoning tends to hurt more than it helps. In other words, if it can do a one-shot classification correctly, adding a bunch of second guessing just degrades the performance. This may change as our RL methods get better at properly rewarding correct partial traces and penalizing overthinking, but for the moment there's often a stark difference when a multi-step process improves the model's ability to reason through the context and when it doesn't. This is made more complicated (for human prompters and evaluators) by the fact that (as Anthropic has demonstrated) the text of the reasoning trace means something very different for the model versus how a human is interpreting it. The reasoning the model claims it is doing can sometimes be worlds away from the actual calculations (e.g., how it uses helixal structures to do addition [1]). [1] https://openreview.net/pdf?id=CqViN4dQJk

▲

aprilthird2021 3 days ago | parent | prev | next [-]

This is what's so depressing about the Apple Intelligence or Gemini ads for consumer AI. Everything they tell us an AI can do for us, like make up a bedtime story for our kids, or write a letter from a kid to his/her hero, or remember someone's name who you forgot from earlier, or sum up a presentation you forgot to read.

Isn't the point to put the time into those things? At some point aren't those the things one should choose to put time into?

	▲	exe34 3 days ago \| parent [-]
		if you're making up stories for your kid, you're not spending enough time consuming media that apple can profit from.

▲

eric-burel 3 days ago | parent | prev | next [-]

Talk to engineers, they just fear research papers. It's important to have alternate ways of consuming research. Then maybe some engineers will jump the fence and start taking the habit of reading papers.

▲

backflippinbozo a day ago | parent | next [-]

AI & ML engineering in particular is very research-adjacent.

That's why we began building agents to source ideas from the arXiv and implement the core-methods from the papers in YOUR target repo months before this publication.

We shared the demo video of it in our production system a while back: https://news.ycombinator.com/item?id=45132898

And we're offering a technical deep-dive into how we built it tomorrow at 9am PST with the AG2 team: https://calendar.app.google/3soCpuHupRr96UaF8

We've built up to 1K Docker images over the past couple months which we make public on DockerHub: https://hub.docker.com/u/remyxai

And we're close to an integration with arXiv that will have these pre-built images linked to the papers: https://github.com/arXiv/arxiv-browse/pull/908

▲

viraptor 2 days ago | parent | prev | next [-]

A lot of them are using obscure vocabulary and sciency notation to express very basic ideas. It's like some switch comes on "this is a PAPER it needs fancy words!"

I'd actually like a change from the other end. Instead of "make agents so good they can implement complex papers", how any "write paper so plainly that current agents can implement reproduction"?

	▲	randomfrogs 2 days ago \| parent [-]
		Scientific vocabulary is designed to be precise. The reason papers are written the way they are is to try to convey ideas with as little chance of misinterpretation as possible. It is maddeningly difficult to do that - I can't tell you how many times I've gotten paper and grant reviews where I cannot fathom how Reviewer 2 (and it's ALWAYS Reviewer 2) managed to twist what I wrote into what they thought I wrote. Almost every time you see something that seems needlessly precise and finicky, it's probably in response to a reviewer's comment, and the secret subtext is "There - now it's so over specified even a rabid wildebeest, or YOU, dear reviewer, couldn't misundertand it!" Unfortunately, a side effect of that is that a lot of the writing ends up seeming needlessly dense.

▲

3 days ago | parent | prev [-]

[deleted]

▲

zaptheimpaler 3 days ago | parent | prev [-]

Easy to say but have you ever read a paper and then a summary or breakdown of that paper by an actual person? Or compare a paper that you do understand very well with how you would explain it in a blog post.

The academic style of writing is almost purposefully as obtuse and dense and devoid of context as possible. Academia is trapped in all kinds of stupid norms.

▲

bryanrasmussen 3 days ago | parent | prev | next [-]

Kill me now.

Yes, I will get right on that. I believe that killing you is the right strategy to help you escape from a world where AI takes over every aspect of human existence in such a way that all those aspects are degraded.

I'm still alive.

That is a very good point, and I am sorry.

	▲	bookofjoe 3 days ago \| parent \| next [-]
		"I Have No Mouth, and I Must Scream" — Harlan Ellison https://www.are.na/block/26283461
	▲	oersted 2 days ago \| parent \| prev [-]
		Still Alive (seems appropriate) https://youtu.be/Y6ljFaKRTrI?si=yz8EOHdN8qdEWoH_

▲

andy99 3 days ago | parent | prev | next [-]

Earlier today there was a post about someone submitting an incorrect AI generated bug report. I found one of the comments telling:

https://news.ycombinator.com/item?id=45331233

> Is it that crazy? He's doing exactly what the AI boosters have told him to do.

I think we're starting to see the first real AI "harms" shake out, after some years of worrying it might swear or tell you how to make a molotov cocktail.

People are getting convinced, by hype men and by sycophantic LLMs themselves, that access to a chatbot suddenly grants them polymath abilities in any field, and are acting out there dumb ideas without pushback, until the buck finally stops, hopefully with just some wasted time and reputation damage.

People should of course continue to use LLMs as they see fit - I just think the branding of work like this gives the impression that they can do more than they can, and will encourage the kind of behavior I mention.

▲

simonw 3 days ago | parent | prev | next [-]

I went looking for how they define "agent" in the paper:

> AI agents are autonomous systems that can reason about tasks and act to achieve goals by leveraging external tools and resources [4]. Modern AI agents are typically powered by large language models (LLMs) connected to external tools or APIs. They can perform reasoning, invoke specialized models, and adapt based on feedback [5]. Agents differ from static models in that they are interactive and adaptive. Rather than returning fixed outputs, they can take multi-step actions, integrate context, and support iterative human–AI collaboration. Importantly, because agents are built on top of LLMs, users can interact with agents through human language, substantially reducing usage barriers for scientists.

So more-or-less an LLM running tools in a loop. I'm guessing "invoke specialized models" is achieved here by running a tool call against some other model.

▲

backflippinbozo a day ago | parent | next [-]

Yeah, probably pretty simple compared to the methods we've publicly discussed for months before this publication.

Here's the last time we showed our demo on HN: https://news.ycombinator.com/item?id=45132898

We'll actually be presenting on this tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8

Besides ReAct, we use AG2's 2-agent pattern with Code Writer and Code Executor in the DockerCommandLineCodeExecutor

Also, using hardware monitors and LLM-as-a-Judge to assess task completion.

It's how we've built nearly 1K Docker images for arXiv papers over the last couple months: https://hub.docker.com/u/remyxai

And how we'll support computational reproducibility by linking Docker images to the arXiv paper publications: https://github.com/arXiv/arxiv-browse/pull/908

▲

eric-burel 3 days ago | parent | prev | next [-]

LLM running tools in a loop is the core idea of ReAct agents, and is indeed one of the most effective way to extract value from a generative AI. Ironically, it's not about generation at all, we use the models classification skills to pick tools and text processing skills to take the context into account.

	▲	ijk 2 days ago \| parent [-]
		I tend to find that using LLMs for interpretation and classification is often more useful for a given business task than wholesale generation.

▲

datadrivenangel 3 days ago | parent | prev [-]

With your definitions of agents as running tools in a loop, do you have high hopes for multi-tool agents being feasible from a security perspective? Seems like they'll need to be locked down

▲

backflippinbozo a day ago | parent | next [-]

No doubt, this toy demo will break your system if the research repo code runs unsecured code.

We thought about this out as we built a system that goes beyond running the quickstart to implement the core-methods of arXiv papers as draft PRs for YOUR target repo.

Running quickstart in sandbox is practically useless.

To limit the attack surface we added PR#1929 to AG2 so we could pass API keys to the DockerCommandLineCodeExecutor and use egress whitelisting to limit the ability of an agent to reach out to a compromised server: https://github.com/ag2ai/ag2/pull/1929

Been talking publicly about this for at least a month before this publication, and along the way we've built up nearly 1K Docker images for arXiv paper code: https://hub.docker.com/u/remyxai

We're close to seeing these images linked to the arXiv papers after PR#908 is merged: https://github.com/arXiv/arxiv-browse/pull/908

And we're actually doing a technical deep-dive with the AG2 team on our work tomorrow at 9am PST: https://calendar.app.google/3soCpuHupRr96UaF8

▲

simonw 3 days ago | parent | prev | next [-]

I think the rule still applies that you should consider any tools as being under the control of anyone who manages to sneak instructions into your context.

Which is a pretty big limitation in terms of things you can safely use them for!

	▲	backflippinbozo 20 hours ago \| parent [-]
		We built agents to test github repo quickstarts associated with arXiv papers a couple months before this paper was published, wrote about it publicly here: https://remyxai.substack.com/p/self-healing-repos We've been pushing it farther to implement draft PRs in your target repo, published a month before this preprint: https://remyxai.substack.com/p/paperswithprs To limit the attack surface we added PR#1929 to AG2 so we could pass API keys to the DockerCommandLineCodeExecutor but also use egress whitelisting to block the ability of an agent to reach a compromised server: https://github.com/ag2ai/ag2/pull/1929 Since then, we've been scaling this with k8s ray workers so we can run this in the cloud to build for the hundreds of papers published daily. By running in Docker, constraining the network interface, deploying on the cloud, and ultimately keeping humans-in-the-loop through PR review, it's hard to see where the prompt-injection attack comes into play from testing the code. Would love to get feedback from an expert on this, can you imagine an attack scenario, Simon? I'll need to work out a check for the case where someone creates a paper with code instructing my agent to publish keys to a public HF repo for others to exfiltrate.

▲

eric-burel 3 days ago | parent | prev [-]

That's a problem discussed in the industry. Currently LLM frameworks don't give enough structure when it comes to the agent authorization, sadly. But it will come.

▲

backflippinbozo a day ago | parent | prev | next [-]

Looks just like our post: https://news.ycombinator.com/item?id=45132898

Which we're presenting on tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8

Difference is we've put this out months ago: https://x.com/smellslikeml/status/1958495101153357835

About to get PR #908 merged so anyone can use one of the nearly 1K Docker images we've already built: https://github.com/arXiv/arxiv-browse/pull/908

We've been publishing about this all Summer on Substack and Reddit: https://www.reddit.com/r/LocalLLaMA/comments/1loj134/arxiv2d...

▲

hereme888 3 days ago | parent | prev | next [-]

I tried it a few weeks ago. Wasn't very impressed with the resulting code compared to me manually working with an LLM and an uploaded research paper, which takes less time and costs less.

▲

SafeDusk 2 days ago | parent | prev | next [-]

Agents enabled me to quickly implement and test out papers by treating them as LLM workflows with composable tools instead of direct implementation.

I have a <a href="https://blog.toolkami.com/alphaevolve-toolkami-style/">blog post</a> detailing this.

[1]: https://blog.toolkami.com/alphaevolve-toolkami-style/

▲

CobrastanJorji 3 days ago | parent | prev | next [-]

So that I understand, is the idea that you point this tool at a GitHub repository, it figures out how to install and run it (figures out the build environment, installs any dependencies, configures the app, etc), plus it figures out how to interact with it, and then you send it queries via a chatbot?

Does it take only the repository as input, or does it also consume the paper itself?

	▲	backflippinbozo a day ago \| parent [-]
		It's a toy version of a product we've been building. We go beyond testing the quickstart to implement the core-methods from arXiv papers as draft PRs for your target repo. Posted a while ago: https://news.ycombinator.com/item?id=45132898 Feel free to join us for the technical deep-dive tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8

▲

vessenes 3 days ago | parent | prev | next [-]

Notable in that this is research out of the genomics lab at Stanford - it’s likely that an ml practitioner could do better with a more hands on approach - but demonstrating some end to end work on genomics implementations as they do in th paper is pretty cool. Seems helpful.

	▲	backflippinbozo a day ago \| parent [-]
		Yeah, you might be interested in our post & demo video: https://news.ycombinator.com/item?id=45132898 Which we're presenting on tomorrow at 9am PST https://calendar.app.google/3soCpuHupRr96UaF8

▲

lkey 3 days ago | parent | prev | next [-]

Science is a collaborative process that occurs between humans that already have a specific shared language to discuss their problem domain. Research papers use deliberately chosen language. This transmission is expert to expert. Inserting a generic statistical model between two experts can only have negative effects. It might be useful for a casual observer who wants an overview, but this is what the abstract already is!.

	▲	IanCal 3 days ago \| parent [-]
		Asking for a pipeline described in a paper to be run over a new set of inputs is not a tool for a casual observer. I’m not sure what benefit making someone build that themselves would have for people with expertise in the field but not coding.

▲

a-dub 2 days ago | parent | prev | next [-]

interesting- so maybe rather than try to standardize on code and data sharing standards, ship an agent that can reformat the data or code to fit into local systems?

▲

the_real_cher 2 days ago | parent | prev | next [-]

does this just save you from copying and pasting the paper into your LLM client?

▲

trolleski 3 days ago | parent | prev | next [-]

Who shaves the barber then? ;)

▲

abss 3 days ago | parent | prev | next [-]

very good direction!. we have to put science in software asap, it is interesting to see the push back but there is no way we can proceed with the curent approach that ignores that we have computers to help..

▲

lawlessone 3 days ago | parent | prev | next [-]

What if you could sit down, have a beer and shoot the shit with Research Papers?

▲

woolion 3 days ago | parent | prev [-]

A lot of people will dismiss with some of the usual AI complaints. I suspect they never did real research. Getting into a paper can be a really long endeavor. The notation might not be entirely self contained, or used in an alien or confusing way. Managing to get into it might finally yield that the results in the paper are not applicable to your own, a point that is often obscured intentionally to make it to publication.

Lowering the investment to understand a specific paper could really help focus on the most relevant results, on which you can dedicate your full resources.

Although, as of now I tend to favor approaches that only summarize rather than produce "active systems" -- with the approximate nature of LLMs, every step should be properly human reviewed. So, it's not clear what signal you can take out of such an AI approach to a paper.

Related, a few days ago: "Show HN: Asxiv.org – Ask ArXiv papers questions through chat"

https://news.ycombinator.com/item?id=45212535

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

https://news.ycombinator.com/item?id=43796419