Everyone should have their own "pelican riding a bicycle" benchmark they test new models on.

And it shouldn't be shared publicly so that the models won't learn about it accidentally :)

I am asking the models to generate an image where fictional characters play chess or Texas Holdem. None of them can make a realistic chess position or poker game. Always something is off like too many pawns or too may cards, or some cards being ace-up when they shouldn't be.

▲

ggsp 4 days ago | parent | prev [-]

Any suggestions for a simple tool to set up your own local evals?

▲

dimava 4 days ago | parent | next [-]

Just ask LLM to write one on top of OpenRouter, AI SDK and Bun To take your .md input file and save outputs as md files (or whatever you need) Take https://github.com/T3-Content/auto-draftify as example

▲

theshrike79 4 days ago | parent | prev | next [-]

My "tool" is just prompts saved in a text file that I feed to new models by hand. I haven't built a bespoke framework on top of it.

...yet. Crap, do I need to now? =)

	▲	ggsp 4 days ago \| parent \| next [-]
		Yeah I’ve wondered about the same myself… My evals are also a pile of text snippets, as are some of my workflows. Thought I’d have a look to see what’s out there and found Promptfoo and Inspect AI. Haven’t tried either but will for my next round of evals
	▲	kedihacker 4 days ago \| parent \| prev \| next [-]
		Well you need to stop them from getting incorporated into its training data
	▲	lobsterthief 4 days ago \| parent \| prev [-]
		_Brain backlog project #77 created_

▲

4 days ago | parent | prev [-]

[deleted]