Remix.run Logo
simonw 2 days ago

Opus 4.5 really is something else. I've been having a ton of fun throwing absurdly difficult problems at it recently and it keeps on surprising me.

A JavaScript interpreter written in Python? How about a WebAssembly runtime in Python? How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?

And these are mostly just casual experiments, often run from my phone!

krackers 2 days ago | parent | next [-]

>A JavaScript interpreter written in Python?

I'm assuming this refers to the python port of Bellard's MQJS [1]? It's impressive and very useful, but leaving out the "based on mqjs" part is misleading.

[1] https://github.com/simonw/micro-javascript?

simonw 2 days ago | parent [-]

That's why I built the WebAssembly one - the JavaScript one started with MQJS, but for the WebAssembly one I started with just a copy of the https://github.com/webassembly/spec repo.

I haven't quite got the WASM one into a share-able shape yet though - the performance is pretty bad which makes the demos not very interesting.

dvrp 2 days ago | parent [-]

Isn’t that telling though?

krackers 2 days ago | parent [-]

A good test might be to provide it only about a third of the tests, then when it says it's done, run it on the holdout 2/3 of tests and see how well it did. Of course it may have already seen the other tests during training, but that's not relevant here since the goal is to find whether or not it's just "brute force bumbling" its way through the task relying heavily on the test suite as bumper rails for feedback, or if it's actually writing generalizable bug-free code with active awareness of pitfalls and corner cases. (Then again it might be invalidated if this specific project was part of the RL training process. Which it may well have been, it's low hanging fruit to convert any repo with comprehensive test suite into training data).

Either way, most tasks don't have the luxury of a thorough test suite, as the test suite itself is the product of arduous effort in debugging and identifying corner case.

burntsushi 2 days ago | parent | prev | next [-]

> How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?

How did it do? :-)

simonw 2 days ago | parent [-]

Alarmingly well! https://gisthost.github.io/?1bf98596a83ff29b15a2f4790d71c41d...

It couldn't quite beat the Rust implementation on everything, but it managed to edge it out on at least some of the benchmarks it wrote for itself.

(Honestly it feels like a bit of an afront to the natural order of things.)

That said... I'm most definitely not a Rust or C programmer. For all I know it cheated at the benchmarks and I didn't spot it!

burntsushi 2 days ago | parent | next [-]

Nice. Yeah I'd have to actually look at what it did. For the task of substring search, it's extremely easy to fall into a local optima. The `memchr` crate has oodles of benchmarks, and some of them are very much in tension with others. It's easy to do well on one to the expense of others.

But still, very neat.

simonw 2 days ago | parent [-]

Here's the C code. It pretty much lifted every optimization trick it could find directly from your Rust code, as far as I can tell: https://github.com/simonw/research/blob/main/memchr-c-wrappe...

aizk 2 days ago | parent | prev [-]

What are you using to easily share the conversation as its own webpage? Very nice and tidy.

simonw 2 days ago | parent [-]

A Python tool called claude-code-transcripts that I had Claude Code help me write last month: https://simonwillison.net/2025/Dec/25/claude-code-transcript...

aizk 2 days ago | parent [-]

Very cool

falloutx 2 days ago | parent | prev | next [-]

I have tried to give it extreme problems like creating slime mold pathing algorithm and creating completely new shoe-lacing patterns and it starts struggling with problems which use visual reasoning and have very little consensus on how to solve them.

Loocid 2 days ago | parent | prev | next [-]

I'm not super surprised that these examples worked well. They are complex and a ton of work, but the problems are relatively well defined with tons of documentation online. Sounds ideal for an LLM no?

simonw 2 days ago | parent [-]

Yes, that's a point I've been trying to emphasize: if a problem is well specified a coding agent can crunch for hours on it to get to a solution.

Even better if there's an existing conformance suite to point at - like html5lib-tests or the WenAssembly spec tests.

ronsor 2 days ago | parent | prev | next [-]

One of my first tests with it was "Write a Python 3 interpreter in JavaScript."

It produced tests, then wrote the interpreter, then ran the tests and worked until all of them passed. I was genuinely surprised that it worked.

Calavar 2 days ago | parent | next [-]

There are multiple Python 3 interpreters written in JavaScript that were very likely included in the training data. For example [1] [2] [3]

I once gave Claude (Opus 3.5) a problem that I thought was for sure too difficult for an LLM, and much to my surprise it spat out a very convincing solution. The surprising part was I was already familiar with the solution - because it was almost a direct copy/paste (uncredited) from a blog post that I read only a few hours earlier. If I hadn't read that blog post, I would have been none the wiser that copy/pasting Claude's output would be potential IP theft. I would have to imagine that LLMs solve a lot of in-training-set problems this way and people never realize they are dealing with a copyright/licensing minefield.

A more interesting and convincing task would be to write a Python 3 interpeter in JavaScript that uses register based bytecode instead of stack based, supports optimizing the bytecode by inlining procedures and constant folding, and never allocates memory (all work is done in a single user provided preallocated buffer). This would require integrating multiple disparate coding concepts and not regurgitating prior art from the training data

[1] https://github.com/skulpt/skulpt

[2] https://github.com/brython-dev/brython

[3] https://github.com/yzyzsun/PyJS

wubrr 2 days ago | parent | prev [-]

It's ability to test/iterate and debug issues is pretty impressive.

Though it seems to work best when context is minimized. Once the code passes a certain complexity/size it starts making very silly errors quite often - the same exact code it wrote in a smaller context will come out with random obvious typos like missing spaces between tokens. At one point it started writing the code backwards (first line at the bottom of the file, last line at the top) :O.

troupo 2 days ago | parent | prev | next [-]

On the other hand when I tried it just yesterday, I couldn't really see a difference. As I wrote elsewhere: same crippled context window, same "I'll read 10 irrelevant lines from a file", same random changes etc.

Meanwhile half a year to a year ago I could already point whatever model was du jour at the time at pychromecast and tell it repeatedly "just convert the rest of functionality to Swift" and it did it. No idea about the quality of code, but it worked alongside with implementations for mDNS, and SwiftUI, see gif/video here: https://mastodon.nu/@dmitriid/114753811880082271 (doesn't include chromecast info in the video).

I think agents have become better, but models likely almost entirely plateaued.

Krei-se 2 days ago | parent | prev [-]

Insanely difficult to you maybe because you stopped learning. What you cannot create you don't understand.

simonw 2 days ago | parent [-]

Are you honestly saying that building a new spec-compliant WebAssembly runtime from scratch isn't an absurdly difficult project?

a day ago | parent [-]
[deleted]