I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and follows formatting instructions to the letter. I was (and still am) super super impressed. Even if it does not hold up in benchmarks, it still outperformed in practice.

I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.

Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models

▲

druskacik 6 hours ago | parent | next [-]

This is my experience as well. Mistral models may not be the best according to benchmarks and I don't use them for personal chats or coding, but for simple tasks with pre-defined scope (such as categorization, summarization, etc.) they are the option I choose. I use mistral-small with batch API and it's probably the best cost-efficient option out there.

▲

mbowcut2 5 hours ago | parent | prev | next [-]

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

▲

pants2 5 hours ago | parent | next [-]

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

▲

airstrike 5 hours ago | parent [-]

If you and others have any insights to share on structuring that benchmark, I'm all ears.

There a new model seemingly every week so finding a way to evaluate them repeatedly would be nice.

The answer may be that it's so bespoke you have to handroll every time, but my gut says there's a set of best practiced that are generally applicable.

	▲	pants2 3 hours ago \| parent [-]
		Generally, the easiest: 1. Sample a set of prompts / answers from historical usage. 2. Run that through various frontier models again and if they don't agree on some answers, hand-pick what you're looking for. 3. Test different models using OpenRouter and score each along cost / speed / accuracy dimensions against your test set. 4. Analyze the results and pick the best, then prompt-optimize to make it even better. Repeat as needed.

▲

Legend2440 2 hours ago | parent | prev | next [-]

I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

The only exception I can think of is models trained on synthetic data like Phi.

▲

5 hours ago | parent | prev | next [-]

[deleted]

▲

pembrook 4 hours ago | parent | prev [-]

If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

▲

mrtksn 7 hours ago | parent | prev | next [-]

Some time ago I canceled all my paid subscriptions to chatbots because they are interchangeable so I just rotate between Grok, ChatGPT, Gemini, Deepseek and Mistral.

On the API side of things my experience is that the model behaving as expected is the greatest feature.

There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.

The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.

▲

barrell 6 hours ago | parent | next [-]

Yep I spent 3 days optimizing my prompt trying to get gpt-5 to work. Tried a bunch of different models (some Azure some OpenRouter) and got a better success rate with several others without any tailoring of the prompt.

Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable

▲

barbazoo 6 hours ago | parent | prev | next [-]

> I guess they hope I forget to cancel.

Business model of most subscription based services.

▲

giancarlostoro 5 hours ago | parent | prev | next [-]

Maybe give Perplexity a shot? It has Grok, ChatGPT, Gemini, Kimi K2, I dont think it has Mistral unfortunately.

	▲	mrtksn 4 hours ago \| parent [-]
		I like perplexity actually but haven’t been using it since some time. Maybe I should give it a go :)

▲

6 hours ago | parent | prev | next [-]

[deleted]

▲

acuozzo 5 hours ago | parent | prev [-]

> because they are interchangeable

What is your use-case?

Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.

My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.

My experience is that they're all very, very different from one another.

	▲	mrtksn 4 hours ago \| parent [-]
		my use case is Google replacement, things that I can do by myself so I can verify and things that are not important so I don’t have to verify. Sure, they produce different output so sometimes I will run the same thing on a few different models when Im not sure or happy but I’d don’t delegate the thinking part actually, I always give a direction in my prompts. I don’t see myself running 30min queries because I will never trust the output and will have to do all the work myself. Instead I like to go step by step together.

▲

mentalgear 6 hours ago | parent | prev | next [-]

Thanks for sharing your use case of the mistral models, which are indeed top-notch ! I had a look at phrasing.app, and while a nice website, I found the copy of "Hand-crafted. Phrasing was designed & developed by humans, for humans." somewhat of a false virtue given your statements here of advanced lllm usage.

▲

barrell 6 hours ago | parent [-]

I don't see the contention. I do not use llms in the design, development, copywriting, marketing, blogging, or any other aspect of the crafting of the application.

I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.

▲

basilgohar 5 hours ago | parent [-]

I admire and respect this stance. I have been very AI-hesitant and while I'm using it more and more, I have spaces that I want to definitely keep human-only, as this is my preference. I'm glad to hear I'm not the only one like this.

	▲	barrell 5 hours ago \| parent [-]
		Thank you :) and you're definitely not the only one. Full transparency, the first backend version of phrasing was 'vibe-coded' (long before vibe coding was a thing). I didn't like the results, I didn't like the experience, I didn't feel good ethically, and I didn't like my own development. I rewrote the application (completely, from scratch, new repo new language new framework) and all of the sudden I liked the results, I loved the process, I had no moral qualms, and I improved leaps and bounds in all areas I worked on. Automation has some amazing use cases (I am building an automation product at the end of the day) but so does doing hard things yourself. Although most important is just to enjoy what you do; or perhaps do something you can be proud of.

▲

metadat 7 hours ago | parent | prev | next [-]

Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

▲

barrell 6 hours ago | parent [-]

Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

▲

data-ottawa 6 hours ago | parent | next [-]

With gpt5 did you try adjusting the reasoning level to "minimal"?

I tried using it for a very small and quick summarization task that needed low latency and any level above that took several seconds to get a response. Using minimal brought that down significantly.

Weirdly gpt5's reasoning levels don't map to the OpenAI api level reasoning effort levels.

	▲	barrell 5 hours ago \| parent [-]
		Reasoning was set to minimal and low (and I think I tried medium at some point). I do not believe the timeouts were due to the reasoning taking to long, although I never streamed the results. I think the model just fails often. It stops producing tokens and eventually the request times out.

▲

barbazoo 6 hours ago | parent | prev [-]

Hard to gauge what gibberish is without an example of the data and what you prompted the LLM with.

▲

barrell 6 hours ago | parent [-]

If you wanted examples, you needed only ask :)

These are screenshots from that week: https://x.com/barrelltech/status/1995900100174880806

I'm not going to share the prompt because (1) it's very long (2) there were dozens of variations and (3) it seems like poor business practices to share the most indefensible part of your business online XD

▲

barbazoo 5 hours ago | parent | next [-]

Surely reads like someone's brain transformed into a tree :)

Impressive, I haven't seen that myself yet, I've only used 5 conversationally, not via API yet.

	▲	barrell 5 hours ago \| parent [-]
		Heh it's a quote from Archer FX (and admittedly a poor machine translation, it's a very old expression of mine). And yes, this only happens when I ask it to apply my formatting rules. If you let GPT format itself, I would be surprised if this ever happens.

▲

sandblast 6 hours ago | parent | prev [-]

XD XD

▲

acuozzo 5 hours ago | parent | prev [-]

I have a need to remove loose "signature" lines from the last 10% of a tremendous e-mail dataset. Based on your experience, how do you think mistral-3-medium-0525 would do?

▲

barrell 5 hours ago | parent [-]

What's your acceptable error rate? Honestly ministral would probably be sufficient if you can tolerate a small failure rate. I feel like medium would be overkill.

But I'm no expert. I can't say I've used mistral much outside of my own domain.

▲

acuozzo 4 hours ago | parent [-]

I'd prefer for the error rate to be as close to 0% as possible under the strict requirement of having to use a local model. I have access to nodes with 8xH200, but I'd prefer to not tie those up with this task. I'd, instead, prefer to use a model I can run on an M2 Ultra.

	▲	barrell 3 hours ago \| parent [-]
		If I cannot tolerate a failure rate, I do not use LLMs (or and ML models). But in that case the larger the better. If mistral medium can run on your M2 Ultra then it should be up to the task. Should eek out ministral and be just shy of the biggest frontier models. But I wouldn’t even trust GPT-5 or Claude Opus or Gemini 3 Pro to get close to a zero percent success rate, and for a task such as this I would not expect mistral medium to outperform the big boys