▲ | Launch HN: RunRL (YC X25) – Reinforcement learning as a service(runrl.com) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
71 points by ag8 5 days ago | 24 comments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hey HN, we’re Andrew and Derik at RunRL (https://runrl.com/). We've built a platform to improve models and agents with reinforcement learning. If you can define a metric, we'll make your model or agent better, without you having to think about managing GPU clusters. Here's a demo video: https://youtu.be/EtiBjs4jfCg I (Andrew) was doing a PhD in reinforcement learning on language models, and everyone kept...not using RL because it was too hard to get running. At some point I realized that someone's got to sit down and actually write a good platform for running RL experiments. Once this happened, people started using it for antiviral design, formal verification, browser agents, and a bunch of other cool applications, so we decided to make a startup out of it. How it works: - Choose an open-weight base model (weights are necessary for RL updates; Qwen3-4B-Instruct-2507 is a good starting point) - Upload a set of initial prompts ("Generate an antiviral targeting Sars-CoV-2 protease", "Prove this theorem", "What's the average summer high in Windhoek?") - Define a reward function, using Python, an LLM-as-a-judge, or both - For complex settings, you can define an entire multi-turn environment - Watch the reward go up! For most well-defined problems, a small open model + RunRL outperforms frontier models. (For instance, we've seen Qwen-3B do better than Claude 4.1 Opus on antiviral design.) This is because LLM intelligence is notoriously "spiky"; often models are decent-but-not-great at common-sense knowledge, are randomly good at a few domains, but make mistakes on lots of other tasks. RunRL creates spikes precisely on the tasks where you need them. Pricing: $80/node-hour. Most models up to 14B parameters fit on one node (0.6-1.2 TB of VRAM). We do full fine-tuning, at the cost of parameter-efficiency (with RL, people seem to care a lot about the last few percent gains in e.g. agent reliability). Next up: continuous learning; tool use. Tool use is currently in private beta, which you can join here: https://forms.gle/D2mSmeQDVCDraPQg8 We'd love to hear any thoughts, questions, or positive or negative reinforcement! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | ripbozo 5 days ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Was excited to see something about reinforcement learning as I'm working on training an agent to play a game, but apparently all reinforcement learning nowadays is for LLMs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | 3s 5 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This is really neat! Didn’t realize it could be this simple to run RL on models. Quick question: How would I specify the reward function for tool use? or is this something you automatically do for me when I specify the available tools and their uses? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | papadiamantis9 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Very neat! A) If I want to have a different grading rubric per example (and grade with an LLM as a judge), do I do this through the reward function? B) What's the pricing on the deployed API? (Is it per token?) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | nextworddev 5 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Is there any credence to the view that these startups are basically dspy wrappers | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | namibj 4 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I'd love to see something that can RL an agent (of sorts) that interacts with an interactive theorem prover (like Lean4, Coq, or Isabelle/HOL), (probably/likely via a harness instead of plain shell-like interaction), and actively exploits that discovery itself is not harmful beyond the inference and oracle cost of investigating an abondoned branch. I.e., it's not at all like a typical game, because at no point is "success rate without relying on rollback/savestate-reloading" something that actually matters. An agent that spends evenly on abandoned (exploratory) branches, and on the path that becomes part of the solution that the formal verifier checks to confirm, while having a near-100% solve rate for problems fed to the agent, is a VERY GOOD agent. That's because this task unlike most RL tasks is one where the agent shall utilize discovery to log an interaction trace that can be trivially mechanically trimmed to a verifiable proof for the provided problem. I.e., the hard part is finding ANY path that solves, without spending exponential amounts of compute to brute force the problem over the bounded state size of practical relevance. Because that would be something that takes longer than the heat death of the universe: i.e.,it's theoretically impractical. Most RL tasks want an agent that is particularly good at it's task; and while effort spent to find a proof is certainly something that matters (if only because lower cost means the agent can train on more instances with the same training budget), it's much less relevant than the solve rate itself (fraction of problems for which any verifiably-correct proof sequence can be found at some definable level of effort, expressed as e.g. number of shots, total compute budget for the instance, ratio of exploration nodes to those nodes that become part of the final proof sequence, etc.). Considering that non-benchmark usage would mostly entail semi-synthetic crowd-sourced datasets that are open sub-instances from practical applications of formal verification, as well as more-synthetic instances from very coarse high-level questions (that get mechanically broken down into more-manageable chunks before the RL agent gets to work) like "given these more-specific rules of what is _actually_ UB and what is only UB in ANSI but actually defined in the specific toolchain that we use: does that C program over there contain ANY UB?" or "is there ANY way that input at that file/network-socket over there to that program over here could execute arbitrary code", there'd not be economic incentive to solve any given instance more than once, beyond what is necessary to make the RL training process itself stable. That task also lends itself to semi-online learning as every supplied instance essentially pays once for a verified solution and the overall process should deliver solid ROI. Running a single GPU cluster/pile for both training and inference would allow higher utilization at the cost of running with some variable amount of latency between rolling out an episode and training on the completed episode's oracle-verified rewards. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | curtisszmania 5 days ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[dead] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
▲ | keycypher 5 days ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
[dead] |