| ▲ | Launch HN: Captain (YC W26) – Automated RAG for Files(runcaptain.com) | |||||||
| 34 points by CMLewis 4 hours ago | 14 comments | ||||||||
Hi HN, we’re Lewis and Edgar, building Captain to simplify unstructured data search (https://runcaptain.com). Captain automates the building and maintenance of file-based RAG pipelines. It indexes cloud storage like S3 and GCS, plus SaaS sources like Google Drive. There’s a quick walkthrough at https://youtu.be/EIQkwAsIPmc. We also put up this demo site called “Ask PG’s Essays” which lets you ask/search the corpus of pg’s essays, to get a feel for how it works: https://pg.runcaptain.com. The RAG part of this took Captain about 3 minutes to set up. Here are some sample prompts to get a feel for the experience: “When do we do things that don't scale? When should we be more cautious?” https://pg.runcaptain.com/?q=When%20do%20we%20do%20things%20... “Give me some advice, I'm fundraising” https://pg.runcaptain.com/?q=Give%20me%20some%20advice%2C%20... “What are the biggest advantages of Lisp” https://pg.runcaptain.com/?q=what%20are%20the%20biggest%20ad... A good production RAG pipeline takes substantial effort to build, especially for file workloads. You have to handle ETL or text extraction, chunking, embedding, storage, search, re-ranking, inference, and often compliance and observability – all while optimizing for latency and reliability. It’s a lot to manage. grep works well in some cases, but for agents, semantic search provides significantly higher performance. Cursor uses both and reports 6.5%–23.5% accuracy gains from vector search over grep (https://cursor.com/blog/semsearch). We’ve spent the past four years scaling RAG pipelines for companies, and Edgar’s work at Purdue’s NLP lab directly informed our chunking techniques. In conversations with dozens of engineers, we repeatedly saw DIY pipelines produce inconsistent results, even after weeks of tuning. Many teams lacked clarity on which retrieval strategies best fit their data. We realized that a system to provision storage and embeddings, handle indexing, and continuously update pipelines to reflect the latest search techniques could remove the need for every team to rebuild RAG themselves. That idea became Captain. In practice, one API call indexes URLs, cloud storage buckets, directories, or individual files. Under the hood, we’re converting everything to Markdown. For this, we’ve had good results with Gemini 3 Pro for images, Reducto for complex documents, and Extend for basic OCR. For embedding models, ‘gemini-embedding-001’ performed reasonably well at first, but we later switched to the Contextualized Embeddings from ‘voyage-context-3’. It produced more relevant results than even the newer Voyage 4 models because its chunk embeddings are encoded with awareness of the surrounding document context. We then applied Voyage’s ‘rerank-2.5’ as second-stage re-ranking, reducing 50 initial chunks to a final top 15 (configurable in Captain’s API). Dense embeddings are just half the picture and full-text search with RRF complete our hybrid retrieval. In the Captain API, these techniques are exposed through a single /query endpoint. Access controls can be configured via metadata filters, and page number citations are returned automatically. The stack is constantly changing but the Captain API creates a standard interface for this. You can try Captain, 1 month for free, and build your own pipelines at https://runcaptain.com. We’re looking for candid feedback, especially anything that can make it more useful, and look forward to your comments! | ||||||||
| ▲ | ttamslam 4 minutes ago | parent | next [-] | |||||||
Congrats on the launch! For Captain Odyssey - how (if at all) do you position yourself with respect to insider trading / material non-public info? Is it all above board because it's private company only (although the docs say Private and public company profiles)? Or is there actually not liability associated with the transfer of potential mnpi, just the act of trading on it? | ||||||||
| ▲ | vg_head 3 hours ago | parent | prev | next [-] | |||||||
Good looking! I didn't get to watch the video or look at docs in depth, but do the results trace back to the location of the answers in a document? Let's say it finds an answer in a PDF, and I'd like to know where in that PDF the citation is. Is that possible or intended? | ||||||||
| ||||||||
| ▲ | jzig an hour ago | parent | prev | next [-] | |||||||
This is an interesting product, thanks for sharing. Can you elaborate on some of your competitors in this landscape and what you might do differently compared to each one? | ||||||||
| ▲ | jamiequint 3 hours ago | parent | prev | next [-] | |||||||
This is cool, like qmd as a service with real-time integrations where it matters? How do you handle more structured data like csv/xlsx/json? Would be cool if it were possible to auto-process links to markdown (e.g. youtube, podcast, arbitrary websites, etc) a la https://github.com/steipete/summarize (which can pull full text in addition to summarizing). | ||||||||
| ||||||||
| ▲ | mchusma 3 hours ago | parent | prev | next [-] | |||||||
Having tried this a bit I do really like the single api call for all of it. I also appreciate transparent pricing but I am not 100% sure the sense of scale of costs. It could be helpful to give some ballparks on things for each of the plans. I'm not sure exactly what i could get out of a plan. My guess, trying hard to figure it out, was if i had about 1,000 pages of new/updated content per month, I would pay $295/month for unlimited queries on top of it. Is that roughly correct? | ||||||||
| ||||||||
| ▲ | cleansy an hour ago | parent | prev | next [-] | |||||||
Just some unfiltered feedback after checking out the website: from what I understand this is an SaaS only? So basically I’m asked to upload ALL company docs to a company that existed for basically a minute with some questionable SOC2 report. Soc2 is basically dead as a security artefact and the data asked to upload is sensitive by nature. I don’t see that working. | ||||||||
| ||||||||
| ▲ | jzig 3 hours ago | parent | prev | next [-] | |||||||
> spotty RAG :O | ||||||||
| ▲ | BoorishBears an hour ago | parent | prev | next [-] | |||||||
Are you writing the integrations listed there, or is are you using something that manages the data connections? | ||||||||
| ||||||||
| ▲ | maxperience an hour ago | parent | prev [-] | |||||||
Interesting to see still solutions being developed for RAG. We developed a solution similar to yours: Automatic indexing from GDrive, SharePoint etc. and then advanced hierarchical chunking, context header based markdown conversion etc... All the tricks that were published last year while RAG was still the "new" kid in town. We finally open sourced everything as the competition from the big players (Notion AI, Google etc.) was daunting. If anyone is interested, this blog post about all the techniques we tried and what actually works is still relevant and up2date: https://bytevagabond.com/post/how-to-build-enterprise-ai-rag... | ||||||||