I believe choosing a well known problem space in a well known language certainly influenced a lot of the behavior. AIs usefulness is correlated strongly with its training data and there’s no doubt been a significant amount of data about both the problem space and Python.

I’d love to see how this compares when either the problem space is different or the language/ecosystem is different.

It was a great read regardless!

▲

dazzawazza 3 days ago | parent | next [-]

I think you are correct. I work in game dev. Almost all code is in C/C++ (with some in Python and C#).

LLMs are nothing more than rubber ducking in game dev. The code they generate is often useful as a starting point or to lighten the mood because it's so bad you get a laugh. Beyond that it's broadly useless.

I put this down to the relatively small number of people who work in game dev resulting in relatively small number of blogs from which to "learn" game dev.

Game Dev is a conservative industry with a lot of magic sauce hidden inside companies for VERY good reasons.

▲

Lerc 4 days ago | parent | prev | next [-]

One of my test queries for AI models is to ask it for an 8 bit asm function to do something that was invented recently enough that there is unlikely to be an implementation yet.

Multiplying two 24 bit posits in 8-bit Avr for instance. No models have succeeded yet, but usually because they try and put more than 8 bits into a register. Algorithmically it seems like they are on the right track but they don't seem to be able to hold the idea that registers are only 8-bits through the entirety of their response.

▲

bugglebeetle 3 days ago | parent [-]

Do you provide this context or just ask the model to one-shot the problem?

▲

Lerc 3 days ago | parent [-]

A clear description of the problem, but one-shot.

Something along the lines of

Can you generate 8-bit AVR assembly code to multiply two 24 bit posit numbers

You get some pretty funny results from the models that have no idea what a posit is. It's usually pretty clear to tell if they know what they are supposed to be doing. I haven't had a success yet (haven't tried for a while though). Some of them have come pretty close, but usually it's the trying to squeeze more than 8 bits of data into a register is what brings them down.

	▲	bugglebeetle 3 days ago \| parent [-]
		Yeah, so it’d be interesting to see if provided the correct context/your understanding of its error pattern, it can accomplish this. One thing you learn quickly about working with LLMs if they have these kind of baked-in biases, some of which are very fixed and tied to their very limited ability to engage in novel reasoning (cc François Chollet), while others are far more loosely held/correctable. If it sticks with the errant patten, even when provided the proper context, it probably isn’t something an off-the-shelf model can handle.

▲

Insanity 4 days ago | parent | prev | next [-]

100% this. I tried haskelling with LLMs and it’s performance is worse compared to Go.

Although in fairness this was a year ago on GPT 3.5 IIRC

▲

diggan 4 days ago | parent | next [-]

> Although in fairness this was a year ago on GPT 3.5 IIRC

GPT3.5 was impressive at the time, but today's SOTA (like GPT 5 Pro) are almost night-and-difference both in terms of just producing better code for wider range of languages (I mostly do Rust and Clojure, handles those fine now, was awful with 3.5) and more importantly, in terms of following your instructions in user/system prompts, so it's easier to get higher quality code from it now, as long as you can put into words what "higher quality code" means for you.

▲

ocharles 4 days ago | parent | prev | next [-]

I write Haskell with Claude Code and it's got remarkably good recently. We have some code at work that uses STM to have what is essentially a mutable state machine. I needed to split a state transition apart, and it did an admirable job. I had to intervene once or twice when it was going down a valid, but undesirable approach. This almost one shot performance was already a productivity boost, but didn't quite build. What I find most impressive now is the "fix" here is to literally have Claude run the build and see the errors. While GHC errors are verbose and not always the best it got everything building in a few more iterations. When it later got a test failure, I suggested we add a bit more logging - so it logged all state transitions, and spotted the unexpected transition and got the test passing. We really are a LONG way away from 3.5 performance.

▲

r_lee 4 days ago | parent | prev | next [-]

I'm not sure I'd say "100% this" if I was talking about GPT 3.5...

	▲	verelo 4 days ago \| parent \| next [-]
		Yeah, 3.5 was good when it came out but frankly anyone reviewing AI for coding not using sonnet 4.1, GPT-5 or equivalent is really not aware of what they've missed out on.
	▲	Insanity 4 days ago \| parent \| prev [-]
		Yah, that’s a fair point. I had assumed it’d remain relatively similar given that the training data would be smaller for languages like Haskell versus languages like Python & JavaScript.

▲

danielbln 4 days ago | parent | prev | next [-]

Post-training in all frontier models has improved significantly wrt to programming language support. Take Elexir, which LLMs could barely handle a test ago, but now support has gotten really good

▲

computerex 4 days ago | parent | prev | next [-]

3.5 was a joke in coding compared to sonnet 4.

	▲	Insanity 4 days ago \| parent \| next [-]
		Yup fair point, it’s been some time. Although vibe coding is more “miss” than “hit” for me.
	▲	pizza 3 days ago \| parent \| prev [-]
		It's so thrilling that this is actually true in just a year

▲

johnisgood 4 days ago | parent | prev [-]

I wrote some Haskell using Claude. It was great.

▲

SatvikBeri 4 days ago | parent | prev | next [-]

I've had a lot of good luck with Julia, on high performance data pipelines.

	▲	bugglebeetle 3 days ago \| parent [-]
		Write a blog post about this! Would love to read it.

▲

jszymborski 4 days ago | parent | prev [-]

ChatGPT is pretty useless at Prolog IME