| ▲ | michaelbuckbee 2 hours ago | |
I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being able to find a suitable one in the market. The market's being split into 1. Longitudinal LLM observability tooling Most eval startups have gone down the route of something more like being an observability platform for LLM inference. They want to be in your stack and running the inference to collect data on performance of it. They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info. 2. Safety Limiting / Pentesting Say you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails. 3. Simple cost + performance + quality swapping This is what my tool does, basically lets you test if you _really_ need to be running that frontier model in a loop across a million records or if you'd be better with an older model or something else. Example eval: https://giyd8stidy.evvl.io | ||
| ▲ | jimmypk 15 minutes ago | parent [-] | |
[flagged] | ||