Have you noticed any significant AND consistent differences between them when you switch? I frequently get a better answer from one vs the other, but it feels unpredictable. Your setup seems like a better test of this

▲

raw_anon_1111 14 hours ago | parent | next [-]

For the most part, I don’t do chatbots except for a couple of RAG based chatbots. It’s more behind the scenes stuff like image understanding, categorization, nuanced sentiment analsys, semantic alignment, etc.

I’ve created a framework that lets me test the quality in automated way between prompt changes and models and I compare costs/speed/quality.

The only thing that requires humans to judge the qualify out of all those are RAG results.

▲

biophysboy 14 hours ago | parent [-]

So who is the winner using the framework you created?

▲

raw_anon_1111 14 hours ago | parent [-]

It depends. Amazon’s Nova Light gave me the best speed vs performance when I needed really quick real time inference for categorizing a users input (think call centers).

One of Anthropics models did the best with image understanding with Amazon’s Nova Pro being slightly behind.

For my tests, I used a customer’s specific set of test data.

For RAG I forgot. But is much more subjective. I just gave the customer an ability to configure the model and modify the prompt so they could choose.

	▲	biophysboy 14 hours ago \| parent [-]
		Your experience matches mine then... I haven't noticed any clear, consistent differences. I'm always looking for second opinions on this (bc I've gotten fairly cynical). Appreciate it

▲

kevstev 13 hours ago | parent | prev [-]

checkout https://poe.com - it does the same thing. I agree with your assessment though, while you can get better answers from some models than others, being able to predict in advance which model will give you the better answer is hard to predict.