Remix.run Logo
kibwen 4 days ago

Last week I wanted to generate some test data for some unit tests for a certain function in a C codebase. It's an audio codec library, so I could have modified the function to dump its inputs to disk and then run the library on any audio file and then hardcoded the input into the unit tests. Instead, I decided I wanted to save a few bytes and wanted to look at generating dummy data dynamically. I wanted to try out Claude for generating the code that would generate the data, so to keep the context manageable I extracted the function and all its dependencies into a self-contained C program (less than 200 lines altogether) and asked it to write a function that would generate dummy data, in C.

Impressively, it recognized the structure of the code and correctly identified it as a component of an audio codec library, and provided a reasonably complete description of many minute details specific to this codec and the work that the function was doing.

Rather less impressively, it decided to ignore my request and write a function that used C++ features throughout, such as type inference and lambdas, or should I say "lambdas" because it was actually just a function-defined-within-a-function that tried to access and mutate variables outside of its own function scope, like we were writing Javascript or something. Even apart from that, the code was rife with the sorts of warnings that even a default invocation of gcc would flag.

I can see why people would be wowed by this on its face. I wouldn't expect any average developer to have such a depth of knowledge and breadth of pattern-matching ability to be able to identify the specific task that this specific function in this specific audio codec was performing.

At the same time, this is clearly not a tool that's suitable for letting loose on a codebase without EXTREME supervision. This was a fresh session (no prior context to confuse it) using a tightly crafted prompt (a small, self-contained C program doing one thing) with a clear goal, and it still required constant handholding.

At the end of the day, I got the code working by editing it manually, but in an honest retrospective I would have to admit that the overall process actually didn't save me any time at all.

Ironically, despite how they're sold, these tools are infinitely better at going from code to English than going the other way around.

angusturner 4 days ago | parent | next [-]

I feel this. I've had a few tasks now where in honest retrospect I find myself asking "did that really speed me up". Its a bit demoralising cause not only do you waste time, you have a worse mental model of the resulting code and feel less sense of ownership over the result.

Brainstorming, ideation and small, well defined tasks where I can quickly vet the solution : these feel like the sweet spot for current frontier model capabilities.

(Unless you are pumping out some sloppy React SPA that you don't care about anything except get it working as fast as possible - fine, get Claude code to one shot it)

Filligree 4 days ago | parent | prev | next [-]

There’s been a lot of noise about Claude performance degradation, and the current best option is probably Codex, but this still surprises me. It sounds like it succeeded on the hard part, then stumbled on the easy bit.

Just two questions, if you don’t mind satisfying my curiosity.

- Did you tell it to write C? Or better yet, what was the prompt? You can use Claude --resume to easily find that.

- Which model? (Sooner or Opus)? Though I’d have expected either one to work.

chrisweekly 4 days ago | parent [-]

Sooner -> Sonnet

walleeee 4 days ago | parent | prev [-]

> Ironically, despite how they're sold, these tools are infinitely better at going from code to English than going the other way around.

Yes. Decently useful (and reasonably safe) to red team yourself with. But extremely easy to red queen yourself otherwise.