Remix.run Logo
Show HN: ArXiv-txt, LLM-friendly ArXiv papers(arxiv-txt.org)
20 points by jerpint 2 days ago | 9 comments

Just change arxiv.org to arxiv-txt.org in the URL to get the paper info in markdown

Example:

Original URL: https://arxiv.org/abs/1706.03762

Change to: https://arxiv-txt.org/abs/1706.03762

To fetch the raw text directly, use https://arxiv-txt.org/raw/abs/1706.03762, this will be particularly useful for APIs and agents

lgas a day ago | parent | next [-]

It just extracts the abstracts?

jerpint a day ago | parent [-]

For now , yes - abstracts and other metadata

rrekaf 21 hours ago | parent [-]

do you plan on adding descriptions of figures and tables?

jerpint 17 hours ago | parent [-]

will probably focus on getting the text out of the papers first, figures might be a good next step after that

sbpost a day ago | parent | prev | next [-]

The example you give doesn't seem to work - the raw txt does not have authors.

jerpint 17 hours ago | parent [-]

you're right - I hadn't noticed! I fixed it now, thanks for pointing it out

jmartin2683 a day ago | parent | prev | next [-]

This would be awesome wrapped in an MCP server/tool call :)

jerpint a day ago | parent [-]

whoa - i haven't yet played with MCP - might be a good first project!

westurner a day ago | parent | prev [-]

If you train an LLM on only formally verified code, it should not be expected to generate formally verified code.

Similarly, if you train an LLM on only published ScholarlyArticles ['s abstracts], it should not be expected to generate publishable or true text.

Traceability for Retraction would be necessary to prevent lossy feedback.