▲ | Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps | |||||||||||||||||||||||||||||||
107 points by jeffreyip a day ago | 24 comments | ||||||||||||||||||||||||||||||||
Hi HN - we're Jeffrey and Kritin, and we're building Confident AI (https://confident-ai.com). This is the cloud platform for DeepEval (https://github.com/confident-ai/deepeval), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs. We spent the past year building DeepEval with the goal of providing the best LLM evaluation developer experience, growing it to run over 600K evaluations daily in CI/CD pipelines of enterprises like BCG, AstraZeneca, AXA, and Capgemini. But the fact that DeepEval simply runs, and does nothing with the data afterward, isn’t the best experience. If you want to inspect failing test cases, identify regressions, or even pick the best model/prompt combination, you need more than just DeepEval. That’s why we built a platform around it. Here’s a quick demo video of how everything works: https://youtu.be/PB3ngq7x4ko Confident AI is great for RAG pipelines, agents, and chatbots. Typical use cases involve allowing companies to switch the underlying LLM, rewrite prompts for newer (and possibly cheaper) models, and keep test sets in sync with the codebase where DeepEval tests are run. Our platform features a "dataset editor," a "regression catcher," and "iteration insights". The datasets editor in Confident AI allows domain experts to edit datasets while keeping them in sync with your codebase for evaluation. We’ll then generate sharable LLM testing/benchmark reports once DeepEval has finished running evaluations on these datasets that are pulled from the cloud. The regression catcher then identifies any regressions in your new implementation, and we use these evaluation results to determine the best iteration based on your metric scores. Our goal is to make benchmarking LLM applications so reliable that picking the best implementation is as simple as reading the metric values off the dashboard. To achieve this, the quality of curated datasets and the accuracy and reliability of metrics must be the highest possible. This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal. To address this, we recently released a DAG (Directed Acyclic Graph) metric in DeepEval. It is a decision-tree-based, LLM-as-a-judge metric that provides deterministic results by breaking a test case into finer atomic units. Each edge represents a decision, each node represents an LLM evaluation step, and each leaf node returns a score. It works best in scenarios where success criteria are clearly defined, such as text summarization. The DAG metric is still in its early stages, but our hope is that by moving towards better, code-driven, open-source metrics, Confident AI can deliver deterministic LLM benchmarks that anyone can blindly trust. We hope you’ll give Confident AI a try. Quickstart here: https://docs.confident-ai.com/confident-ai/confident-ai-intr... The platform runs on a freemium tier, and we've dropped the need to signup with a work email for the next four days. Looking forward to your thoughts! | ||||||||||||||||||||||||||||||||
▲ | codelion 7 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||
The DAG feature for subjective metrics sounds really promising. I've been struggling with the same "good email" problem. Most of the existing benchmarks are too rigid for nuanced evaluations like that. Looking forward to seeing how that part of DeepEval evolves. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | nisten a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
This looks nice and flashy for an investor presentation, but practically I just need the thing to work off of an API or if it is all local to at least have vllm support so it doesn't take 10 hours to run a bench. The extra long documentation and abstractions for me personally are exactly what I DONT want to have in a benchmarking repo. I.e. what transformers version is this, will it support TGI v3, will it automatically remove thinking traces with a flag in the code or running command, will it run the latest models that need custom transformer version etc. And if it's not a locally runnable product it should at least have a public accessable leaderboard to submit oss models too or something. Just my opinion. I don't like it. It looks like way too much docs and code slop for what should just be a 3 line command. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | llm_trw 20 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal. Have you done any work on dynamic data generation? I've found that even taking a public benchmark and remixing the order of questions had a deep impact on model performance - ranging from catastrophic for tiny models to problematic for larger models once you get past their effective internal working memory. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | stereobit 11 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
DAG sounds interesting. Might help me to solve my biggest challenge with evals right now, which is testing subjective metrics e.g. “is this a good email” | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | tracyhenry a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
This looks great. I would love to know more what makes Confident AI/DeepEval special compared to tons of other LLM Eval tools out there. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | jchiu220 21 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
This is an awesome tool! Been using it since day 1 and will keep using it. Would recommend to anyone looking for an LLM Eval tool | ||||||||||||||||||||||||||||||||
▲ | TeeWEE a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Was also looking at Langfuse.ai or braintrust.dev Anybody with experience can give me a tip of the best way to - evaluate - manage prompts - trace calls | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | fullstackchris a day ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
Congrats guys! Back in the spring of last year I did an initial spike investigating tools that could evaluate the accuracy of responses in our RAG queries where I work. We used your services (tests and test dashboard) as a little demo. | ||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||
▲ | avipeltz 20 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||
this is sick, all star founders making big moves ;) | ||||||||||||||||||||||||||||||||
▲ | calebkaiser a day ago | parent | prev [-] | |||||||||||||||||||||||||||||||
<deleted> | ||||||||||||||||||||||||||||||||
|