I've been making skills from arxiv papers for a while. I have a one for multi-object tracking for example. It has a SKILL.md describing all important papers (over 30) on the subject and a folder with each paper's full content as reStructuredText.

To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.

I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.

▲

paulluuk 2 hours ago | parent | next [-]

This sounds like it would work, but honestly if you've already read all 30 papers fully, what do you still need to llm to do for you? Just the boilerplate?

	▲	simlevesque an hour ago \| parent [-]
		I'm trying to make a go library that implements a wide ranges of MOT algorithms and can gather metrics for all of them. Reading all the papers once isn't the same as this. I find it very useful. I can ask an LLM to do the basic implementations, then I can refine them (make the code better, faster, cut on memory use), then I can ask the LLM if I'm still implementing the algorithms as they're described in the paper.

▲

ctoth an hour ago | parent | prev | next [-]

I've been working on ctoth/research-papers-plugin, the pipeline to actually get LLMs to extract the notes. I really like your insight re RST over Markdown! It sounds like we're working on similar stuff and I'll absolutely reach out :)

	▲	simlevesque an hour ago \| parent [-]
		I'm gonna look at your plugin. My email is in my profile. Honestly I think that Markdown with LateX code blocks would be the most efficient representation but when doing it with Pandoc I kept having issues with loss of information and sometimes even syntax error.

▲

satvikpendem an hour ago | parent | prev | next [-]

Does that even fit in the context? It seems like 30 papers worth of content would just overflow it.

	▲	ctoth an hour ago \| parent [-]
		For each paper, have your agent extract a three sentence description, create a description.md, then concat those with the paper names into an INDEX.md which it should consult to find appropriate papers. Also: have your agent tag papers, then autogenerate your tagged collection on the filesystem. Then you get nice things like https://github.com/ctoth/Qlatt/tree/master/papers/tagged Then something in your {CLAUDE,AGENTS}.md that says: when working on something with relevant context supplied by papers, read the papers before doing the work. You can find all papers plus their descriptions in ./papers/INDEX.md and papers by tag in ./papers/tagged

▲

alex000kim 2 hours ago | parent | prev | next [-]

sounds similar to "LLM Knowledge Bases" https://xcancel.com/karpathy/status/2039805659525644595

▲

MrLeap 2 hours ago | parent | prev [-]

What is RST?

	▲	simlevesque 2 hours ago \| parent \| next [-]
		reStructuredText: https://www.sphinx-doc.org/en/master/usage/restructuredtext/...
	▲	an hour ago \| parent \| prev \| next [-]
		[deleted]
	▲	an hour ago \| parent \| prev [-]
		[deleted]