>This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.

Have you done any work on dynamic data generation?

I've found that even taking a public benchmark and remixing the order of questions had a deep impact on model performance - ranging from catastrophic for tiny models to problematic for larger models once you get past their effective internal working memory.

▲

jeffreyip 5 months ago | parent [-]

Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?

Do do synthetic data generation for custom application use cases. Such as RAG, summarization, text-sql, etc. We call this module the "synthesizer", and you can customize your data generation pipeline however you want (I think, let me know otherwise!).

Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.

▲

llm_trw 5 months ago | parent [-]

>Interesting, how are you remixing the order of questions? If we're talking about an academic benchmark like MMLU, the questions are independent of one another. Unless you're generating multiple answers in one go?

Short version: if a model can answer a very high proportion of questions from a benchmark accurately then the next step is to ask it two or more questions at a time. On some models the quality of answers varies dramatically with which is asked first.

>Docs for synthesizer's here: https://docs.confident-ai.com/docs/synthesizer-introduction, there's a nice "how does it work" section at the bottom explaining it more.

Very good start, but the statistics of the generated text matter a lot.

As an example on a dumb as bricks benchmark I've designed I can saturate the reasoning capabilities of all non-reasoning models just by varying the names of objects in the questions. A model that could get a normalized score of 14 with standard object strings could get a score as high as 18 with one letter strings standing for objects and as low as zero with arbitrary utf-8 character strings - which turns out mattered a lot since all the data was polluted with international text coming from the stock exchanges.

Feel free to drop me a line if you're interested in a more in depth conversation. LLMs are _ridiculously_ under tested for how many places they show up in.

	▲	jeffreyip 5 months ago \| parent [-]
		Hey yes would definitely love to, my contact info is in my bio, please drop me an email :)