Remix.run Logo
kccqzy 8 hours ago

But do you actually treat LLMs as glorified autocomplete or treat them as puzzle solvers where you give them difficult tasks beyond your own intellect?

Recently I wrote a data transformation pipeline and I added a note that the whole pipeline should be idempotent. I asked Claude to prove it or find a counterexample. It found one after 25 minutes of thinking; I reasonably estimate that it would take me far longer, perhaps one whole day. I couldn’t care less about using Claude to type code I already knew.

CoolGuySteve 8 hours ago | parent | next [-]

"give them difficult tasks beyond your own intellect?"

Lol no, I've yet to find a model with those properties. Sounds like a fast track to AI psychosis.

The domain I work in doesn't have enough public documentation for these models to be particularly helpful without a lot of handholding though.

hombre_fatal 8 hours ago | parent [-]

I've been working on a luks+btrfs+systemd tool (for managing an encrypted raid1 pool). While I have worked with each individually, it's not obvious what kind of cases you have to handle when composing them together. A lot of it is simply emergent, and the status quo has been to do your best and then see what actually happens at runtime.

Documentation is helpful to describe high-level intentions, but the beauty is when you have access to source code. Now a good model can derive behavior from implementation instead of docs which are inherently limited.

I implemented the luks+btrfs part by hand a few years ago, and I resurrected the project a couple months ago. Using source code for local reference, Claude discovered so many major cases I missed, especially in the unhappy-path scenarios. Even in my own hand-written tests. And it helped me set up an amazing NixOS VM test system include reproduction tests on the libraries to see what they do in weird undocumented cases.

So I think "tasks beyond our intellect (and/or time and energy)" can be fitting. Otherwise I'd only be capable of polishing this project if luks+btfs+systemd were specifically my day job. I just can't fit so much in my head and working memory.

zekica 7 hours ago | parent [-]

And it can fail in great ways. Last example: I asked claude for a non-trivial backup and recovery script using restic. I gave it the whole restic repo and it still made up parameters that don't exist in the code (but exist in a pull request that's been sitting not merged for 10+ months).

hombre_fatal 7 hours ago | parent [-]

Interesting. I don't think I've seen hallucinations at that level when it's referencing source code.

Though my workflow always starts in plan mode where Claude is clearly more thorough (which is the reason it takes 10x as long as going straight to impl). I rarely skip it.

shimman 8 hours ago | parent | prev [-]

This says more about you than the "intellect" of these nondeterministic probability programs.

Can you provide actual context to what was beyond your ability and how you're able to determine if the solution was correct?

Finding out that all these comments that reference the "magical incantation" tend to be full of hot air. Maybe yours is different.

kccqzy 7 hours ago | parent [-]

> how you're able to determine if the solution was correct

I had hundreds of unit tests that did not trigger an assertion I added for idempotency. Claude wrote one that triggered an assertion failure. Simple as that. A counterexample suffices.