Remix.run Logo
rplnt 4 days ago

> Almost all models got almost all my evaluations correct

I find this the most surprising. I have yet to cross 50% threshold of bullshit to possibly truth. In any kind of topic I use LLMs for.

simonw 4 days ago | parent | next [-]

It's useful to build up an intuition for what kind of questions LLMs can answer and what kind of questions they can't.

Once you've done that your success rate goes way up.

Aachen 4 days ago | parent | next [-]

While it's useful to not bother when you know it's unlikely to give good results, it does also feel a bit like a cop-out to suggest that the user shouldn't be asking it certain (unspecified) things in the first place. If this is the only solution, we should just crowdsource topics or types of question it can't do >50% of the time so not everyone has to reinvent the wheel

theshrike79 3 days ago | parent | next [-]

If you ask an LLM to count the r's in "strawberry sherbert", it's 100% hit and miss.

But have it create a script or program in any language you want to do the same, I'm 99% sure it'll get it right the first time.

People use LLMs like graphing calculators, they're not. But you can have one MAKE a calculator and it'll get it right.

mierz00 4 days ago | parent | prev [-]

It’s not that simple.

I’m making a tool to analyse financial transactions for accountants and identify things like misallocated expenses. Initially I was getting an LLM to try and analyse hundreds of transactions in one go. It was correct roughly 40-50% of the time, inconsistent and hallucinated frequently.

I changed the method to simple yes no question and to analyse each transaction individually. Now it is correct 85% of the time and very consistent.

Same model, same question essentially but a different way of asking it.

Aachen 3 days ago | parent [-]

I don't see how that issue couldn't be an entry on the "not to do" or "not optimal usage" list

mierz00 2 hours ago | parent [-]

Edit: I see your point. That’s valid.

I’m just not so sure it’s black and white. At least in my experience it hasn’t been.

rplnt 4 days ago | parent | prev | next [-]

Oftentimes I ask simple factual questions that I don't know the answer to. This is something it should excel at, yet it usually fails, at least on the first try. I guess I subconsciously ignore questions that are extremely easy to google (if you ignore the worst AI in existence) or can be found by opening the [insert keyword] wikipedia article. You don't need AI for those.

simonw 4 days ago | parent [-]

Amusingly enough, my rule of thumb for if an LLM is likely to be able to answer a question is "could somebody who just read the relevant Wikipedia page answer this?"

Although that changed this year with o3 (and now GPT-5) getting really good at using Bing for search: https://simonwillison.net/2025/Apr/21/ai-assisted-search/

apwell23 4 days ago | parent | prev [-]

> It's useful to build up an intuition for what kind of questions LLMs can answer and what kind of questions they can't.

Can you put you intuition into words so we can learn from you ?

simonw 4 days ago | parent [-]

I can't. That's my single biggest frustration about using LLMs: so much of what they can and cannot do comes down to intuition you need to build up over time, and I can't figure out how to express that intuition in a way that can quickly transfer to other people.

Workaccount2 4 days ago | parent | prev [-]

Would you be willing to share some of those chats?

rplnt 4 days ago | parent [-]

The most recent one I have was not in English. It was a translation question of a slang word between two non-English languages. It failed miserably (just made up some complete nonsense). Google had no trouble finding relevant pages or images for that word (without any extra prompt), so it was rather unique and not that obscure. Disclaimer: I'm not using any extra prompts like "don't make shit up and just tell me you don't know".

Most recent technical I can remember (and now would be a good time to have the actual prompt) was that I asked whether MySQL has a way to run UPDATE without waiting for lock. Basically ignore rows that are locked. It (Sonnet 4 IIRC) answered of course and gave me an invalid query in the form of `UDPATE ... SKIP LOCKED`;

I can't imagine what damage this does if people are using it for questions they don't/can't verify. Programming is relatively safe in this regard.

But as I noted in my other reply, there will be a bias on my side, as I probably disregard questions that I know how to easily find answers to. That's not something I'd applaud AI for.