Remix.run Logo
crossroadsguy 5 days ago

The more and I am using Gemini (paid, Pro) and ChatGPT (free) the more I am thinking - my job isn't going anywhere yet. At least not after the CxOs have all gotten their cost-saving-millions-bonuses and work has to be done again.

My goodness, it just hallucinates and hallucinates. It seems these models are designed for nothing other than maintaining an aura of being useful and knowledgeable. Yeah, to my non-ai-expert-human eyes that's what it seems to me - these tools have been polished to project this flimsy aura and they start acting desperately the moment their limits are used up and that happens very fast.

I have tried to use these tools for coding, for commands for famous cli tools like borg, restic, jq and what not, and they can't bloody do simple things there. Within minutes they are hallucinating and then doubling down. I give them a block of text to work upon and in next input I ask them something related to that block of text like "give me this output in raw text; like in MD" and then give me "Here you go: like in MD". It's ghastly.

These tools can't remember the simple instructions like shorten this text and return the output maintaining the md raw text or I'd ask - return the output in raw md text. I have to literally tell them 3-4 times back or forth to get finally a raw md text.

I have absolutely stopped asking them for even small coding tasks. It's just horrible. Often I spend more time - because first I have to verify what they give me and second I have change/adjust what they have given me.

And then the broken tape recorder mode! Oh god!

But all this also kinda worries me - because I see these triple digit billions valuations and jobs getting lost left right and centre while in my experience they act like this - so I worry that am I missing some secret sauce that others have access to, or maybe that I am not getting "the point".

energy123 5 days ago | parent | next [-]

Hallucinating all the way to gold medals in IOI and IMO?

crossroadsguy 5 days ago | parent [-]

Maybe I just need a small canoe to go from one place to another? Not a bloody aircraft carrier, if that is an aircraft carrier?

energy123 5 days ago | parent [-]

The models you're using are on the low compute end of the frontier. That's why you're getting bad results.

At the high-compute end of the frontier, by next year, systems should be better than any human at competition coding and competition math. They're basically already there now.

Play this out for another 5 years. What happens when compute becomes 4-20x more abundant and these systems keep getting better?

That's why I don't share your outlook that our jobs are safe. At least not on a 5-8 year timescale. At least not in their current form of actually writing any code by hand.

crossroadsguy 5 days ago | parent [-]

And I don’t share your implied optimism that it’s wise to look beyond 5-8 years in any geopolitical/social/economical climate, let alone in today’s.

logicprog 5 days ago | parent | prev | next [-]

I'm really confused by your experience to be honest. I by no means believe that LLMs can reason, or will replace any human beings any time soon, or any of that nonsense (I think all that is cooked up by CEOs and C-suite to justify layoffs and devalue labor) and I'm very much on the side that's ready for the AI hype bubble to pop, but also terrified by how big it is, but at the same time, I experience LLMs as infinitely more competent and useful than you seem to, to the point that it feels like we're living in different realities.

I regularly use LLMs to change the tone of passages of text, or make them more concise, or reformat them into bullet points, or turn them into markdown, and so on, and I only have to tell them once, alongside the content, and they do an admirably competent job — I've almost never (maybe once that I can recall) seen them add spurious details or anything, which is in line with most benchmarks I've seen (https://github.com/vectara/hallucination-leaderboard), and they always execute on such simple text-transformation commands first-time, and usually I can paste in further stuff for them to manipulate without explanation and they'll apply the same transformation, so like, the complete opposite of your multiple-prompts-to-get-one-result experience. It's to the point where I sometimes use local LLMs as a replacement for regex, because they're so consistent and accurate at basic text transformations, and more powerful in some ways for me.

They're also regularly able to one-shot fairly complex jq commands for me, or even infer the jq commands I need just from reading the TypeScript schemas that describe the JSON an API endpoint will produce, and so on, I don't have to prompt multiple times or anything, and they don't hallucinate. I'm regularly able to have them one-shot simple Python programs with no hallucinations at all, that do close enough to what I want that it takes adjusting a few constants here and there, or asking them to add a feature or two.

> And then the broken tape recorder mode! Oh god!

I don't even know what you mean by this, to be honest.

I'm really not trying to play the "you're holding it wrong / use a bigger model / etc" card, but I'm really confused; I feel like I see comments like yours regularly, and it makes me feel like I'm legitimately going crazy.

crossroadsguy 5 days ago | parent [-]

I have replied in another comment about the tape recorder thingie.

No, that's okay - as I said I might be holding it wrong :) At least you engaged in your comment in a kind and detailed manner. Thank you.

More than what it can do and what it can't do - it's a lot about how easily it can do that, how reliable that is or can be, and how often it frustrates you even at simple tasks and how consistently it doesn't say "I don't know this, or I don't know this well or with certainty" which is not only difficult but dangerous.

The other day Gemini Pro told me `--keep-yearly 1` in `borg prune` means one archive for every year. Now I luckily knew that. So I grilled it and it stood its ground until I told it (lied to it) "I lost my archives beyond 1 year because you gave incorrect description of keep-yearly" and bang it says something like "Oh, my bad.. it actually means this.. ".

I mean one can look at it in any way one wants at the end of the day. Maybe I am not looking at the things that it can do great, or maybe I don't use it for those "big" and meaningful tasks. I was just sharing my experience really.

logicprog 5 days ago | parent [-]

Thanks for responding! I wonder if one of the differences between our experiences is that for me, if the LLM doesn't give me a correct answer (or at least something I can build on) — and fast! I just ditch it completely and do it myself. Because these things aren't worth arguing with or fiddling with, and if it isn't quick then I run out of patience :P

crossroadsguy 5 days ago | parent [-]

My experience is not what you indicated. I was talking about evaluating it. That's what I was discussing in my first comment. Seeing how it works and my experience so far has been pretty abysmal. In my coding work (which I don't do a lot since last ~1 year) I have not "moved to it" for help/assistance and the reason is what I have mentioned in these comments. That it has not been reliable at all. By at all I don't mean 100% unreliable of course but not 75-95% either. I mean I ask it 10 doubts questions and It screws up too often for me to fully trust it and requires me to equal or more work in verifying what it does then why not I'd just do it myself or verify from sources that are trust worthy. I don't really know when it's not "lying" so I am always second guessing and spending/wasting my time try to verify it. But how do you factually verify a large body of output that it produced to you as inference/summary/mix? It gets frustrating.

I'd rather try a LLM to whom I through some sources at or refer to them by some kind of ID and ask them to summarise, give me examples based on those (e.g man pages) and they give me just that near 100% accuracy. That will be more productive imho.

logicprog 5 days ago | parent [-]

> I'd rather try a LLM to whom I through some sources at or refer to them by some kind of ID and ask them to summarise, give me examples based on those (e.g man pages) and they give me just that near 100% accuracy. That will be more productive imho.

That makes sense! Maybe an LLM with web search enabled, or Perplexity, or something like AnythingLM that let's it reference docs you provide, might be more to your taste

PaulStatezny 5 days ago | parent | prev | next [-]

> And then the broken tape recorder mode! Oh god!

Can you elaborate? What is this referring to?

crossroadsguy 5 days ago | parent [-]

It does/says something wrong. You give it feedback and then it's a loop! Often it just doesn't get it. You supply it webpages (text only webpages - which it can easily read, or I hope so). It says it got it and next line the output is the old wrong answer again.

There are worse examples, here is one (I am "making this up" :D to give you an idea):

> To list hidden files you have to use "ls -h", you can alternatively use "ls --list".

Of course you correct it, try to reason and then supply a good old man page url and after few times it concedes and then it gives you the answer again:

> You were correct in pointing the error out. to list the hidden files you indeed have to type "ls -h" or "ls --list"

Also - this is just really a mild example.

weitendorf 5 days ago | parent | next [-]

I suspect you are interacting with LLMs in a single, long conversation corresponding to your "session" and prompting fixes/new info/changes in direction between tasks.

This is a very natural and common way to interact with LLMs but also IMO one of the biggest avoidable causes of poor performance.

Every time you send a message to an LLM you actually send the entire conversation history. Most of the time a large portion of that information will no longer be relevant, and sometimes it will be wrong-but-corrected later, both of which are more confusing to LLMs than to us because of the way attention works. The same applies to changes in the current task/objective or instructions: the more outdated, irrelevant, or inconsistent they are, the more confused the LLM becomes.

Also, LLMs are prone to the Purple Elephant problem (just like humans): the best way to get them to not think about purple elephants is to not mention them at all, as opposed to explicitly instructing them not to reference purple elephants. When they encounter errors, they are biased to previous assumptions/approaches they tend to have laid out previously in the conversation.

I generally recommend using many short per-task conversations to interact with LLMs, with each having as little irrelevant/conflicting context as possible. This is especially helpful for fixing non-trivial LLM-introduced errors because it reframes the task and eliminates the LLM's bias towards the "thinking" that caused it to introduce the bug to begin with

logicprog 5 days ago | parent | prev [-]

Hi from the other thread :P

If you'll forgive me putting my debugging hat on for a bit, because solving problems is what most if us do here, I wonder if it's not actually reading the URL, and maybe that's the source of the problem, bc I've had a lot of success feeding manuals and such to AIs and then asking it to synthesize commands or asking it questions about them. Also, I just tried asking Gemini 2.5 Flash this and it did a web search, found a source, answered my question correctly (ls -a, or -la for more detail), and linked me to the precise part of its source it referenced: https://kinsta.com/blog/show-hidden-files/#:~:text=If%20you'... (this is the precise link it gave me).

crossroadsguy 5 days ago | parent [-]

Well, in one case (it was borg or restic doc) I noticed it actually picked something correctly from the URL/page and then still messed up in the answer.

What my guess is - maybe it read the URL and mentioned a few things as one part of its "that" answer/output but for the other part it relied it on the learning it already had. Maybe it doesn't learn "on the go". I don't know, could be a safeguard against misinformation or spamming the model or so.

As I said in my comment, I hadn't asked it "ls -a" question but rather something else - different commands on different times which I don't recall now except borg and restic ones which I did recently. "ls -a" is the example I picked to show one of the things I was"cribbing" about.

logicprog 5 days ago | parent [-]

Yeah my bad, I was responding late at night and had a reading comprehension failure

5 days ago | parent | prev | next [-]
[deleted]
bongodongobob 5 days ago | parent | prev [-]

There's no way this isn't a skill issue or you are using shitty models. You can't get it to write markdown? Bullshit.

Right now, Claude is building me an AI DnD text game that uses OpenAI to DM. I'm at about 5k lines of code, about a dozen files, and it works great. I'm just tweaking things at this point.

You might want to put some time into how to use these tools. You're going to be left behind.

crossroadsguy 5 days ago | parent [-]

> You can't get it to write markdown? Bullshit.

Please f off! Just read the comment again whether I said "can't get it to write MD". Or better yet just please f off?

By the way, judging by your reading comprehension - I am not sure now who is getting left behind.

bongodongobob 4 days ago | parent [-]

That's crazy bro.