I'm really confused by your experience to be honest. I by no means believe that LLMs can reason, or will replace any human beings any time soon, or any of that nonsense (I think all that is cooked up by CEOs and C-suite to justify layoffs and devalue labor) and I'm very much on the side that's ready for the AI hype bubble to pop, but also terrified by how big it is, but at the same time, I experience LLMs as infinitely more competent and useful than you seem to, to the point that it feels like we're living in different realities.

I regularly use LLMs to change the tone of passages of text, or make them more concise, or reformat them into bullet points, or turn them into markdown, and so on, and I only have to tell them once, alongside the content, and they do an admirably competent job — I've almost never (maybe once that I can recall) seen them add spurious details or anything, which is in line with most benchmarks I've seen (https://github.com/vectara/hallucination-leaderboard), and they always execute on such simple text-transformation commands first-time, and usually I can paste in further stuff for them to manipulate without explanation and they'll apply the same transformation, so like, the complete opposite of your multiple-prompts-to-get-one-result experience. It's to the point where I sometimes use local LLMs as a replacement for regex, because they're so consistent and accurate at basic text transformations, and more powerful in some ways for me.

They're also regularly able to one-shot fairly complex jq commands for me, or even infer the jq commands I need just from reading the TypeScript schemas that describe the JSON an API endpoint will produce, and so on, I don't have to prompt multiple times or anything, and they don't hallucinate. I'm regularly able to have them one-shot simple Python programs with no hallucinations at all, that do close enough to what I want that it takes adjusting a few constants here and there, or asking them to add a feature or two.

> And then the broken tape recorder mode! Oh god!

I don't even know what you mean by this, to be honest.

I'm really not trying to play the "you're holding it wrong / use a bigger model / etc" card, but I'm really confused; I feel like I see comments like yours regularly, and it makes me feel like I'm legitimately going crazy.

▲

crossroadsguy 5 days ago | parent [-]

I have replied in another comment about the tape recorder thingie.

No, that's okay - as I said I might be holding it wrong :) At least you engaged in your comment in a kind and detailed manner. Thank you.

More than what it can do and what it can't do - it's a lot about how easily it can do that, how reliable that is or can be, and how often it frustrates you even at simple tasks and how consistently it doesn't say "I don't know this, or I don't know this well or with certainty" which is not only difficult but dangerous.

The other day Gemini Pro told me `--keep-yearly 1` in `borg prune` means one archive for every year. Now I luckily knew that. So I grilled it and it stood its ground until I told it (lied to it) "I lost my archives beyond 1 year because you gave incorrect description of keep-yearly" and bang it says something like "Oh, my bad.. it actually means this.. ".

I mean one can look at it in any way one wants at the end of the day. Maybe I am not looking at the things that it can do great, or maybe I don't use it for those "big" and meaningful tasks. I was just sharing my experience really.

▲

logicprog 5 days ago | parent [-]

Thanks for responding! I wonder if one of the differences between our experiences is that for me, if the LLM doesn't give me a correct answer (or at least something I can build on) — and fast! I just ditch it completely and do it myself. Because these things aren't worth arguing with or fiddling with, and if it isn't quick then I run out of patience :P

▲

crossroadsguy 5 days ago | parent [-]

My experience is not what you indicated. I was talking about evaluating it. That's what I was discussing in my first comment. Seeing how it works and my experience so far has been pretty abysmal. In my coding work (which I don't do a lot since last ~1 year) I have not "moved to it" for help/assistance and the reason is what I have mentioned in these comments. That it has not been reliable at all. By at all I don't mean 100% unreliable of course but not 75-95% either. I mean I ask it 10 doubts questions and It screws up too often for me to fully trust it and requires me to equal or more work in verifying what it does then why not I'd just do it myself or verify from sources that are trust worthy. I don't really know when it's not "lying" so I am always second guessing and spending/wasting my time try to verify it. But how do you factually verify a large body of output that it produced to you as inference/summary/mix? It gets frustrating.

I'd rather try a LLM to whom I through some sources at or refer to them by some kind of ID and ask them to summarise, give me examples based on those (e.g man pages) and they give me just that near 100% accuracy. That will be more productive imho.

	▲	logicprog 5 days ago \| parent [-]
		> I'd rather try a LLM to whom I through some sources at or refer to them by some kind of ID and ask them to summarise, give me examples based on those (e.g man pages) and they give me just that near 100% accuracy. That will be more productive imho. That makes sense! Maybe an LLM with web search enabled, or Perplexity, or something like AnythingLM that let's it reference docs you provide, might be more to your taste