I’m deeply sceptical. Every time a major announcement comes out saying so-and-so model is now a triple Ph.D programming triathlon winner, I try using it. Every time it’s the same - super fast code generation, until suddenly staggering hallucinations.

If anything the quality has gotten worse, because the models are now so good at lying when they don’t know it’s really hard to review. Is this a safe way to make that syscall? Is the lock structuring here really deadlock safe? The model will tell you with complete confidence its code is perfect, and it’ll either be right or lying, it never says “I don’t know”.

Every time OpenAI or Anthropic or Google announce a “stratospheric leap forward” and I go back and try and find it’s the same, I become more convinced that the lying is structural somehow, that the architecture they have is not fundamentally able to capture “I need to solve the problem I’m being asked to solve” instead of “I need to produce tokens that are likely to come after these other tokens”.

The tool is incredible, I use it constantly, but only for things where truth is irrelevant, or where I can easily verify the answer. So far I have found programming, other than trivial tasks and greenfield ”write some code that does x”, much faster without LLMs

▲

NotOscarWilde 2 days ago | parent | next [-]

> Is the lock structuring here really deadlock safe? The model will tell you with complete confidence its code is perfect

Fully agree, in fact, this has literally happened to me a week ago -- ChatGPT was confidently incorrect about its simple lock structure for my multithreaded C++ program, and wrote paragraphs upon paragraphs about how it works, until I pressed it twice about a (real) possibility of some operations deadlocking, and then it folded.

> Every time a major announcement comes out saying so-and-so model is now a triple Ph.D programming triathlon winner, I try using it. Every time it’s the same - super fast code generation, until suddenly staggering hallucinations.

As an university assistant professor trying to keep up with AI while doing research/teaching as before, this also happens to me and I am dismayed by that. I am certain there are models out there that can solve IMO and generate research-grade papers, but the ones I can get easy access to as a customer routinely mess up stuff, including:

* Adding extra simplifications to a given combinatorial optimization problem, so that its dynamic programming approach works.

* Claiming some inequality is true but upon reflection it derived A >= B from A <= C and C <= B.

(This is all ChatGPT 5, thinking mode.)

You could fairly counterclaim that I need to get more funding (tough) or invest much more of my time and energy to get access to models closer to what Terrence Tao and other top people trying to apply AI in CS theory are currently using. But at least the models cheap enough for me to get access as a private person are not on par with what the same companies claim to achieve.

▲

empiricus 2 days ago | parent | prev [-]

I agree that the current models are far from perfect. But I am curious how you see the future. Do you really think/feel they will stop here?

▲

jakewins 2 days ago | parent [-]

I mean, I'm just some guy, but in my mind:

- They are not making progress, currently. The elephant-in-the-room problem of hallucinations is exactly the same or, as I said above, worse as it was 3 years ago

- It's clearly possible to solve this, since we humans exist and our brains don't have this problem

There's then two possible paths: Either the hallucinations are fundamental to the current architecture of LLMs, and there's some other aspect about the human brains configuration that they've yet to replicate. Or the hallucinations will go away with better and more training.

The latter seems to be the bet everyone is making, that's why there's all these data centers being built right? So, either larger training will solve the problem, and there's enough training data, silica molecules and electricity on earth to perform that "scale" of training.

There's 86B neurons in the human brain. Each one is a stand-alone living organism, like a biological microcontroller. It has constantly-mutating state, memory: short term through RNA and protein presence or lack thereof, long term through chromatin formation, enabling and disabling it's own DNA over time, in theory also permanent through DNA rewriting via TEs. Each one has a vast array of input modes - direct electrical stimulation, chemical signalling through a wide array of signaling molecules and electrical field effects from adjacent cells.

Meanwhile, GPT-4 has 1.1T floats. No billions of interacting microcontrollers, just static floating points describing a network topology.

The complexity of the neural networks that run our minds is spectacularly higher than the simulated neural networks we're training on silicon.

That's my personal bet. I think the 88B interconnected stateful microcontrollers is so much more capable than the 1T static floating points, and the 1T static floating points is already nearly impossibly expensive to run. So I'm bearish, but of course, I don't actually know. We will see. For now all I can conclude is the frontier model developers lie incessantly in every press release, just like their LLMs.

	▲	xmcqdpt2 a day ago \| parent \| next [-]
		The complexity of actual biological neural networks became clear to me when I learned about the different types of neurons. https://en.wikipedia.org/wiki/Neural_oscillation There are clock neurons, ADC neurons that transform analog intensity of signal into counts of digital spikes, there are neurons that integrate signals over time, that synchronizes together etc etc. Transformer models have none of this.
	▲	empiricus 2 days ago \| parent \| prev [-]
		Thanks, that's a reasonable argument. Some critique: based on this argument it is very surprising that LLM work so well, or at all. The fact that even small LLM do something suggests that the human substrate is quite inefficient for thinking. Compared to LLMs, it seems to me that 1. some humans are more aware of what they know; 2. humans have very tight feedback loops to regulate and correct. So I imagine we do not need much more scaling, just slightly better AI architectures. I guess we will see how it goes.