Remix.run Logo
weitendorf 4 hours ago

One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.

I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).

The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…

movpasd 2 hours ago | parent | next [-]

I've not done particularly rigorous testing, but I've done this a lot with Claude to get a feel. What I've noticed is for certain open-ended tasks, Claude is extremely primeable: it will pick up on minor differences in wording in your prompt and run with them hard.

It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.

These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.

evntdrvn 22 minutes ago | parent | prev | next [-]

One thing that I learned when doing raw API LLM usage is how drastically the results can vary call per call with exactly the same input. I think that on average, people using agents underestimate the variation in results from a given turn command are, and so overindex on "X technique worked well" or "if I do Y then this will happen" or even "it did Z task well last time so it will this time too" or "{Model} is great at {thing}"

dotancohen 2 hours ago | parent | prev | next [-]

  > We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
Any chance you could share some of these? Seems like something we could all benefit from.
weitendorf 2 hours ago | parent [-]

Sure, my company has been working on a broad swathe of infrastructure projects and developer tools, which requires prompting models to seek out other tools/apis/docs/examples but in a way where we can't just dump all the context on the model up front. We also need the models to oftentimes look up technical documentation and specs, and sometimes build custom parsers for specific documentation websites that only make the data available embedded across 200+ pages of html.

First, I almost always try to seed every new project or context/domain with canonical technical specifications or examples I found elsewhere. When I set up this project recently, I linked to a bunch of the official Apple docs for sysctl, and told it to use a specific technique for calling assembly code from Go, that from experience it almost never realizes it can do or knows about (and similarly for sysctl, I knew it kinda sorta knew about it, but not in its entirely): https://github.com/accretional/sysctl/commit/da52438233e5b33...

The other thing I did was tell it to enumerate all the test cases ahead of time rather than to just directly implement them; again this is something where you have to explicitly tell it to go digging for information where it has blind spots and get it to set up properly grounded self-eval in a way that it can test against. I usually tell it to take notes as it works or commit notes to itself that will persist over sessions: https://github.com/accretional/sysctl/blob/main/FINDINGS_2.m...

Once we get back to working on this project we'll just have it implement / validate the rest of the sysctl feature support against the full inventory we had it uncover: https://github.com/accretional/sysctl/blob/main/cmd/darwin-n...

Another thing we do is have it specify an API that it can produce against; then in other projects we have them consume the API via reflection (and our special sauce we've been working on is the ability to discover and integrate against these automatically across thousands of APIs from many providers, which we've got working and can share if you're interested in using it as an early customer): https://github.com/accretional/sysctl/blob/main/proto/sysctl... This isn't the greatest example because it doesn't actually fully specify the sysctl keys yet. But I did have it create a knowledge base trying to cover the 1000+ keys as best as it could, to reference as it continued: https://github.com/accretional/sysctl/tree/main/macos-sysctl...

We have a better example in eg https://github.com/accretional/proto-sqlite/tree/main/lang where we were able to encode the entire sqlite grammar into a grpc interface so that you could eg find the exact structure (and sanitize) of a select statement: https://github.com/accretional/proto-sqlite/blob/main/lang/p... This way integration and discovery becomes a matter of telling it "use reflection against this endpoint to discover the sql interface, then implement against it" and we can model formats/input validation as formal grammars via EBNF (all magic words) vs just adhoc

We also tell it to set up and use a browser automation toolkit/testing and always run it at the end of testing workflows (often in a way that auto-opens screenshots on our local machines + commits them to git) via tools like https://github.com/accretional/chromerpc#headlessbrowser-aut... so that whenever we produce UIs it can evaluate its own output and iterate without direct human intervention. This is another case where the knowledge-discovery problem becomes a problem so we tell the models to use reflection to discover the browser automation apis. That ends up giving us things like this where it records user journeys through sites and creates visualizations without us having to debug them or do them ourselves: https://github.com/accretional/proto-css/tree/main/chrome-te...

mnicky 2 hours ago | parent | prev [-]

If the benefits of using the model you've come to know well outweigh the disadvantages, you can continue using it even after the release of a successor model, right?