Oh wow - I recently tried 3 Pro preview and it was too slow for me.

After reading your comment I ran my product benchmark against 2.5 flash, 2.5 pro and 3.0 flash.

The results are better AND the response times have stayed the same. What an insane gain - especially considering the price compared to 2.5 Pro. I'm about to get much better results for 1/3rd of the price. Not sure what magic Google did here, but would love to hear a more technical deep dive comparing what they do different in Pro and Flash models to achieve such a performance.

Also wondering, how did you get early access? I'm using the Gemini API quite a lot and have a quite nice internal benchmark suite for it, so would love to toy with the new ones as they come out.

▲

lancekey 5 days ago | parent | next [-]

Curious to learn what a “product benchmark” looks like. Is it evals you use to test prompts/models? A third party tool?

Examples from the wild are a great learning tool, anything you’re able to share is appreciated.

▲

thecupisblue 4 days ago | parent | next [-]

It's an internal benchmark that I use to test prompts, models and prompt-tunes, nothing but a dashboard calling our internal endpoints and showing the data, basically going through the prod flow.

For my product, I run a video through a multimodal LLM with multiple steps, combine data and spit out the outputs + score for the video.

I have a dataset of videos that I manually marked for my usecase, so when a new model drops, I run it + the last few best benchmarked models through the process, and check multiple things:

- Diff between outputed score and the manual one - Processing time for each step - Input/Output tokens - Request time for each step - Price of request

And the classic stats of average score delta, average time, p50, p90 etc. + One fun thing which is finding the edge cases, since even if the average score delta is low (means its spot-on), there are usually some videos where the abs delta is higher, so these usually indicate niche edge cases the model might have.

Gemini 3 Flash nails it sometimes even better than the Pro version, with nearly the same times as 2.5 Pro does on that usecase. Actually, pushed it to prod yesterday and looking at the data, it seems it's 5 seconds faster than Pro on average, with my cost-per-user going down from 20 cents to 12 cents.

IMO it's pretty rudimentary, so let me know if there's anything else I can explain.

▲

theshrike79 4 days ago | parent | prev [-]

Everyone should have their own "pelican riding a bicycle" benchmark they test new models on.

And it shouldn't be shared publicly so that the models won't learn about it accidentally :)

▲

bluecalm 3 days ago | parent | next [-]

I am asking the models to generate an image where fictional characters play chess or Texas Holdem. None of them can make a realistic chess position or poker game. Always something is off like too many pawns or too may cards, or some cards being ace-up when they shouldn't be.

▲

ggsp 4 days ago | parent | prev [-]

Any suggestions for a simple tool to set up your own local evals?

▲

dimava 4 days ago | parent | next [-]

Just ask LLM to write one on top of OpenRouter, AI SDK and Bun To take your .md input file and save outputs as md files (or whatever you need) Take https://github.com/T3-Content/auto-draftify as example

▲

theshrike79 4 days ago | parent | prev | next [-]

My "tool" is just prompts saved in a text file that I feed to new models by hand. I haven't built a bespoke framework on top of it.

...yet. Crap, do I need to now? =)

	▲	ggsp 4 days ago \| parent \| next [-]
		Yeah I’ve wondered about the same myself… My evals are also a pile of text snippets, as are some of my workflows. Thought I’d have a look to see what’s out there and found Promptfoo and Inspect AI. Haven’t tried either but will for my next round of evals
	▲	kedihacker 4 days ago \| parent \| prev \| next [-]
		Well you need to stop them from getting incorporated into its training data
	▲	lobsterthief 4 days ago \| parent \| prev [-]
		_Brain backlog project #77 created_

▲

4 days ago | parent | prev [-]

[deleted]

▲

m00dy 4 days ago | parent | prev [-]

May I ask your internal benchmark ? I'm building a new set of benchmarks and testing suite for agentic workflows using deepwalker [0]. How do you design your benchmark suite ? would be really cool if you can give more details.

[0] https://deepwalker.xyz

▲

thecupisblue 4 days ago | parent [-]

Shared a bit more here - https://news.ycombinator.com/item?id=46314047.

But pretty rudimentary, nothing special. Also did not know about deepwalker, looks quite interesting - you building it?

	▲	m00dy 3 days ago \| parent [-]
		I personally know the team who builds the product.