| ▲ | Launch HN: Pulse (YC S24) – Production-grade unstructured document extraction | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 39 points by sidmanchkanti21 4 days ago | 40 comments | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hi HN, we’re Sid and Ritvik, co-founders of Pulse (https://www.runpulse.com/). Pulse is a document extraction system to create LLM-ready text using hybrid VLM + OCR models. Here’s a demo video: https://video.runpulse.com/video/pulse-platform-walkthrough-.... Later in this post, you’ll find links to before-and-after examples on particularly tricky cases. Check those out to see what Pulse can really do! Modern vision language models are great at producing plausible text, but that makes them risky for OCR and data ingestion. Plausibility isn’t good enough when you need accuracy. When we started working on document extraction, we assumed the same thing many teams do: foundation models are improving quickly, multi-modal systems appear to read documents well, what’s not to like? And indeed, for small or clean inputs, those assumptions mostly give good results. However, limitations show up once you begin processing real documents in volume. Long PDFs, dense tables, mixed layouts, low-fidelity scans, and financial or operational data expose errors that are subtle, hard to detect, and expensive to correct. Outputs look reasonable even though they contain small but important mistakes, especially in tables and numeric fields. Running into those challenges got us working. We ran controlled evaluations on complex documents, fine tuned vision models, and built labeled datasets where ground truth actually matters. There have been many nights where our team stayed up hand-annotating pages, drawing bounding boxes around tables, labeling charts point by point, or debating whether a number was unreadable or simply poorly scanned. That process shaped our intuition far more than benchmarks. One thing became clear quickly. The core challenge is not extraction itself, but confidence. Vision language models embed document images into high-dimensional representations optimized for semantic understanding, not precise transcription. That process is inherently lossy. When uncertainty appears, models tend to resolve it using learned priors instead of surfacing ambiguity. This behavior can be helpful in consumer settings. In production pipelines, it creates verification problems that do not scale well. Pulse grew out of our trying to address this gap through system design rather than prompting alone. Instead of treating document understanding as a single generative step, our system separates layout analysis from language modeling. Documents are normalized into structured representations that preserve hierarchy and tables before schema mapping occurs. Extraction is constrained by schemas defined ahead of time, and extracted values are tied back to source locations so uncertainty can be inspected rather than guessed away. In practice, this results in a hybrid approach that combines traditional computer vision techniques, layout models, and vision language models, because no single approach handles these cases reliably on its own. We are intentionally sharing a few documents that reflect the types of inputs that motivated this work. These are representative of cases where we saw generic OCR or VLM-based pipelines struggle. Here is a financial 10K: https://platform.runpulse.com/dashboard/examples/example1 Here is a newspaper: https://platform.runpulse.com/dashboard/examples/example2 Here is a rent roll: https://platform.runpulse.com/dashboard/examples/example3 Pulse is not perfect, particularly on highly degraded scans or uncommon handwriting, and we’re working on improvements. However, our goal is not to eliminate errors entirely, but to make them visible, auditable, and easier to reason about. Pulse is available via usage-based access to the API and platform You can sign up to try it at https://platform.runpulse.com/login. API docs are at https://docs.runpulse.com/introduction. We’d love to hear how others here evaluate correctness for document extraction, which failure modes you have seen in practice, and what signals you rely on to decide whether an output can be trusted. We will be around to answer questions and are happy to run additional documents if people want to share examples. Put links in the comments and we’ll plug them in and get back to you. Looking forward to your comments! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | dang 4 days ago | parent | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
> happy to run additional documents if people want to share examples I've got one! The pdf of this out-of-print book is terrible: https://archive.org/details/oneononeconversa0000simo. The text is unreadably faint, and the underlying text layer is full of errors, so copy-paste is almost useless. Can your software extract usable text? (I'll email you a copy of the pdf for convenience since the internet archive's copy is behind their notorious lending wall) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | bambax 3 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
OCR is fascinating; I did some experiments on OCR for an ancient French book that made it to HN last year: https://news.ycombinator.com/item?id=42443022 I found that at the time no LLM was able to properly organize the text and understand footnotes structure, but non-AI OCR works very well, and restructuring (with some manual input) is largely feasible. Would be interested in what you can do with those footnotes (including, for good measure, footnotes-within-footnotes). Regarding feeding text to LLMs, it seems they are often able to make sense of text when the layout follows the original, which means the OCR phase doesn't necessarily need to properly understand the structure of the source: rendering the text in a proper layout can be sufficient. I worked on setting up a service that would do just that, but in the end didn't go live with it; but here's the examples page to show what I mean: https://preview.adgent.com/#examples This approach is very straightforward and fails rarely. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | think4coffee 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Congrats on the launch! You mention that you're SOTA on benchmarks. Can you share your research, or share which benchmark you used? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | lajr 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Hey, congratulations on the launch. Just noticed a discrepancy in the financial 10K example: There is a section near the start where there are 4 options: Large accelerated filer, Non-accelerated filer, Accelerated filer, or Smaller reporting company. In this option, "Large accelerated filer" is checked on the PDF, but "Non-accelerated filer" is checked on the Markdown. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | Ishirv 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Super interesting stuff. I’m a fan - been a pulse customer for a while. However, I’ve found it has trouble with things that need intelligence like quotes meaning to repeat the previous line. Is that something you’re working on or is that not the right use case for pulse? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | scottydelta 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AI models will eventually do this natively. This is one of the ways for models to continue to get better, by doing better OCR and by doing better context extraction. I am already seeing this trend in the recent releases of the native models (such as Opus 4.5, Gemini 3, and especially Gemini 3 flash). It's only going to get better from here. Another thing to note is, there are over 5 startups right now in YC portfolio doing the same thing and going after a similar/overlapping target market if I remember correctly. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | TZubiri 2 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How does it handle tables with invisible lines and inconsistent justification? (For example one centered column and one right justified column. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | aryan1silver 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
looks really cool, congrats on the launch! are you guys using something similar to docling[https://github.com/docling-project/docling]? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | throw03172019 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Congrats on launch! We have been using this for a new feature we are building in our SaaS app. It’s results were better than Datalab from our tests, especially in the handwriting category. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | DIVx0 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
can't sign up with gmail or "personal" email addresses? What if I want to evaluate but I am not ready to inundated with sales calls? My 'work' email domain is one that many vendors would love to see in their CRM. I always sign up with disposables first. I guess I should thank you for saving my time? Plenty of others in this space. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | sidcool 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Congrats on launching. Seems very interesting. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | canadiantim 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Can you increase correctness by giving examples to the model? And key terms or nouns expected? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | mikert89 4 days ago | parent | prev | next [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
AI models will do all this natively | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ▲ | asdev 4 days ago | parent | prev [-] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
How is this different from Extend(Also YC)? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||