Remix.run Logo
nisten 2 days ago

This looks nice and flashy for an investor presentation, but practically I just need the thing to work off of an API or if it is all local to at least have vllm support so it doesn't take 10 hours to run a bench.

The extra long documentation and abstractions for me personally are exactly what I DONT want to have in a benchmarking repo. I.e. what transformers version is this, will it support TGI v3, will it automatically remove thinking traces with a flag in the code or running command, will it run the latest models that need custom transformer version etc.

And if it's not a locally runnable product it should at least have a public accessable leaderboard to submit oss models too or something.

Just my opinion. I don't like it. It looks like way too much docs and code slop for what should just be a 3 line command.

jeffreyip 2 days ago | parent [-]

I see, although most users come to us for evaluating LLM applications, you're correct that the academic benchmarking of foundational models is also offered in DeepEval, which I'm assuming what you're talking about.

We actually designed it to make it easily work off any API. How it works is you just have to create a wrapper around your API and you're good to go. We take care of the async/concurrent handling of such benchmarking so the evaluation speed is really just limited by the rate limit of your LLM API.

This link shows what a wrapper looks like: https://docs.confident-ai.com/guides/guides-using-custom-llm...

And once you have your model wrapper setup, you can use any benchmark we provide.