Remix.run Logo
XenophileJKO 5 days ago

Personally my biggest piece of advice is: AI First.

If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.

By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.

Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

imiric 5 days ago | parent | next [-]

> By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests.

Unfortunately, it doesn't quite work out that way.

Yes, you will get better at using these tools the more you use them, which is the case with any tool. But you will not learn what they can do as easily, or at all.

The main problem with them is the same one they've had since the beginning. If the user is a domain expert, then they will be able to quickly spot the inaccuracies and hallucinations in the seemingly accurate generated content, and, with some effort, coax the LLM into producing correct output.

Otherwise, the user can be easily misled by the confident and sycophantic tone, and waste potentially hours troubleshooting, without being able to tell if the error is on the LLM side or their own. In most of these situations, they would've probably been better off reading the human-written documentation and code, and doing the work manually. Perhaps with minor assistance from LLMs, but never relying on them entirely.

This is why these tools are most useful to people who are already experts in their field, such as Filippo. For everyone else who isn't, and actually cares about the quality of their work, the experience is very hit or miss.

> That being said.. you also need to understand how they fail and build an intuition for why they fail.

I've been using these tools for years now. The only intuition I have for how and why they fail is when I'm familiar with the domain. But I had that without LLMs as well, whenever someone is talking about a subject I know. It's impossible to build that intuition with domains you have little familiarity with. You can certainly do that by traditional learning, and LLMs can help with that, but most people use them for what you suggest: throwing things over the wall and running with it, which is a shame.

> I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

I haven't used GPT-5-Codex, but have experience with Sonnet 4.5, and it's only marginally better than the previous versions IME. It still often wastes my time, no matter the quality or amount of context I feed it.

XenophileJKO 5 days ago | parent | next [-]

I guess there are several unsaid assumptions here. The article is by a domain expert working on their domain. Throw work you understand at it, see what it does. Do it before you even work on it. I kind of assumed based on the audience that most people here would be domain experts.

As for the building intuition, perhaps I am over-estimating what most people are capable of.

Working with and building systems using LLMs over the last few years has helped me build a pretty good intuition about what is breaking down when the model fails at a task. While having an ML background is useful in some very narrow cases (like: 'why does an LLM suck at ranking...'), I "think" a person can get a pretty good intuition purely based on observational outcomes.

I've been wrong before though. When we first started building LLM products, I thought, "Anyone can prompt, there is no barrier for this skill." That was not the case at all. Most people don't do well trying to quantify ambiguity, specificity, and logical contridiction when writing a process or set of instructions. I was REALLY surprised how I became a "go-to" person to "fix" prompt systems all based on linguistics and systematic process decomposition. Some of this was understaing how the auto-regressive attention system benefits from breaking the work down into steps, but really most of it was just "don't contradict yourself and be clear".

Working with them extensively also has helped me hone in on how the models get "better" with each release. Though most of my expertise is with OpenAI and Antrhopic model families.

I still think most engineers "should" be able to build intuition generally on what works well with LLMs and how to interact with them, but you are probably right. It will be just like most ML engineers where they see something work in a paper and then just paste it onto their model with no intuition about what systemically that structurally changes in the model dynamics.

fn-mote 5 days ago | parent [-]

> I kind of assumed based on the audience that most people here would be domain experts.

No take on the rest of your comment, but it’s the nature of software engineering that we work on a breadth of problems. Nobody can be a domain expert in everything.

For example: I use a configurable editor every day, but I’m not a domain expert in the configuration. An LLM wasted an hour of my day pointing me in “almost the right direction” when after 10 minutes I really needed to RTFM.

I am a domain expert in some programming languages, but now I need to implement a certain algorithm… I’m not an expert in that algorithm. There’s lots of traps for the unwary.

I just wanted to challenge the assumption that we are all domain experts in the things we do daily. We are, but … with limitations.

imiric 5 days ago | parent [-]

Exactly.

A typical programmer works within unfamiliar domains all the time. It's not just about being familiar with the programming language or tooling. Every project potentially has new challenges you haven't faced before, new APIs to evaluate and design, new tradeoffs to consider, etc.

The less familiar you are with the domain or API, the less instincts and influence you have to steer the LLM in the right direction, and the more inclined you are to trust the tool over yourself. So when the tool is wrong, as it often still is, you can spend a lot of time fighting with it to produce the correct output.

The example in the article is actually the best case scenario for these tools. It's essentially pattern matching using high quality code, from someone who's deeply familiar with the domain and the code they've written. The experience of someone unfamiliar trying to implement the same algorithm from scratch by relying on LLMs would be vastly different.

XenophileJKO 5 days ago | parent [-]

I mean I "understand" your point. However, this isn't any different than being a technical lead in a system of any significant complexity.. you will constantly be reviewing work that you are not always an expert on, it is a very similar practice.

I'm constantly reviewing things that I am not a domain expert on. I have to identify what is risky, what I don't know, etc. Throwing to the AI first is no different than throwing to someone else first. I have the same requirements. Now I can choose how much I "trust" the person or LLM. I have had coworkers I trust less than LLMs.. I'll put it that way.

So just like with reviewing a co-worker.. pay attention to areas you are not sure what the right method is and maybe double-check it. This just isn't a "new" thing.

hitarpetar 4 days ago | parent | next [-]

> Throwing to the AI first is no different than throwing to someone else first

except in all the ways that it is obviously different

imiric 5 days ago | parent | prev [-]

Well, you're right that reviewing someone else's work isn't new, but interacting with these tools is vastly different from communicating with a coworker.

A competent human engineer won't delude you with claims not based in reality, and be confident about it. They can be wrong about practical ways of accomplishing something, but they won't suggest using APIs that don't exist, or go off on wild tangents because a certain word was mentioned. They won't give a different answer whenever you ask them the same question. Most importantly, conversations with humans can be productive in ways that both parties gain a deeper understanding of the topic and respect for each other. Humans can actually think and reason about topics and ideas, they can actually verify their and your claims, and they won't automatically respond with "You're right!" at any counterargument or suggestion.

Furthermore, the marketing around "AI" is strongly based on promoting their superhuman abilities. If we're led to believe that these are superintelligent machines, we're more inclined to trust their output. We have people using them as medical professionals, thinking that they're talking to a god, and being influenced by them. Trusting them to produce software is somewhere on that scale. All of this is highly misleading and potentially dangerous.

Any attempt at anthropomorphizing "AI" is a mistake. You can get much more out of them by using them as what they are: excellent pattern matching probabilistic tools.

5 days ago | parent | prev [-]
[deleted]
Razengan 5 days ago | parent | prev [-]

I did ask the AI first, about some things that I already knew how to do.

It gave me horribly inefficient or long-winded ways of doing it. In the time it took for "prompt tuning" I could have just written the damn code myself. It decreased the confidence for anything else it suggested about things I didn't already know about.

Claude still sometimes insists that iOS 26 isn't out yet. sigh.. I suppose I just have to treat it as an occasional alternative to Google/StackOverflow/Reddit for now. No way would I trust it to write an entire class let alone an app and be able to sleep at night (not that I sleep at night, but that's besides the point)

I think I prefer Xcode's built-in local model approach better, where it just offers sane autocompletions based on your existing code. e.g. if you already wrote a Dog class it can make a Cat class and change `bark()` to `meow()`

theshrike79 5 days ago | parent | next [-]

You can write the "prompt tuning" down in AGENTS.md and then you only need to do it once. This is why you need to keep working with different ones to get the feeling what they're good at and how you can steer them closer to your style and preferences without having to reiterate from scratch every time.

I personally have a git submodule built specifically for shared instructions like that, it contains the assumptions and defaults for my specific style of project for 3 different programming languages. When I update it on one project, all my projects benefit.

This way I don't need to tell whatever LLM I'm working with to use modernc.org/sqlite for database connections, for example.

Razengan 4 days ago | parent [-]

> You can write the "prompt tuning" down in AGENTS.md and then you only need to do it once.

Yeah, I just mean: I know how to "fix" the AI for things that I already know about.

But how would I know if it's wrong or right about the stuff I DON"T know?? I'd have to go Google shit anyway to verify it.

This is me asking ChatGPT 5 about ChatGPT 5: https://i.imgur.com/aT8C3qs.png

Asking about Nintendo Switch 2: https://i.imgur.com/OqmB9jG.png

Imagine if AI was somebody's first stop for asking about those things. They'd be led to believe they weren't out when they in fact were!

theshrike79 3 days ago | parent [-]

There's your problem right there.

Don't use it as a knowledge machine, use it as a tool.

Agentic LLMs are the ones that work. The ones that "use tools in a loop to achieve a goal"[0]. I just asked Claude to "add a release action that releases the project as a binary for every supported Go platform" to one of my Github projects. I can see it worked because the binaries appeared as a release. It didn't "hallucinate" anything nor was it a "stohastic parrot". It applied a well known pattern to a situation perfectly. (OK, it didn't use a build matrix, but that's jsut me nitpicking)

In your cases the LLM should've seen that you're asking about current events or news and used a tool that fetches information about it. Now it just defaulted to whatever built-in training data was in its context and failed spectacularly

AIs have a branding issue, because AI != AI which isn't AI. There are so many flavours that it's hard to figure out what people are talking about when they say "AI slop is crap" when I can see every day how "AI" makes my life easier by automating away the mundane crap.

[0] https://simonwillison.net/2025/Sep/18/agents/

simonw 5 days ago | parent | prev [-]

> Claude still sometimes insists that iOS 26 isn't out yet.

How would you imagine an AI system working that didn't make mistakes like that?

iOS 26 came out on September 15th.

LLMs aren't omniscient or constantly updated with new knowledge. Which means we have to figure out how to make use of them despite them not having up-to-the-second knowledge of the world.

Razengan 5 days ago | parent [-]

> How would you imagine an AI system working that didn't make mistakes like that?

I mean, if the user says "Use the latest APIs as of version N" and the AI thinks version N isn't out yet, then it should CHECK on the web first, it's right there, before second guessing the user. I didn't ask it whether 26 was out or not. I told it.

Oh but I guess AIs aren't allowed to have free use of Google's web search or scrap other websites eh

> iOS 26 came out on September 15th.

It was in beta all year and the APIs were publicly available on Apple's docs website. If I told it to use version 26 APIs then it should just use those instead of gaslighting me.

> LLMs aren't omniscient or constantly updated with new knowledge.

So we shouldn't use them if we want to make apps with the latest tech? Despite what the AI companies want us to believe.

You know, on a more general note, I think all AIs should have a toggle between "Do as I say" (Monkey Paw) and "Do what I mean"

simonw 5 days ago | parent [-]

Was this Claude Code or Claude.ai or some other tool that used Claude under the hood?

Different harnesses have different search capabilities.

If I'm doing something that benefits from search I tend to switch to ChatGPT because I know it has a really good search feature available to it. I don't trust Claude's as much.

Razengan 4 days ago | parent [-]

I used the Claude website and Mac desktop app for a relatively standard iOS SwiftUI project.

I used Claude Code with VS Code for some Godot stuff, and even there it sometimes gave outdated and outright made-up APIs (functions that seemed like they should exist but did not etc.)

simonw 4 days ago | parent [-]

Unfortunately LLMs mostly suck at Swift and SwiftUI from what I've heard - they still change pretty often and as a result there aren't enough fresh examples in the training data.

As primarily a Python/JavaScript programmer I don't have that problem!

Razengan 4 days ago | parent [-]

They're terrible at anything new, including knowing about THEMSELVES and their latest versions.

This is me asking ChatGPT 5 about ChatGPT 5: https://i.imgur.com/aT8C3qs.png

Asking about Nintendo Switch 2: https://i.imgur.com/OqmB9jG.png

This could be solved and LLMs could be a lot more useful if they could be a wrapper around live web search: Just search for this shit, scrap the top few results, and summarize the info to me.

But that's a stillborn dream, crippled because Google won't let 3rd-party AIs use their search willy nilly and websites don't want to be scrapped :(

Don't get me wrong: I see the potential in AIs/LLMs and I think they could be amazing for everything, but like every awesome thing, they're hampered by corporate (and government) idiocy.

simonw 4 days ago | parent [-]

Claude Code has a neat fix for that - it knows to look at its own documentation if you ask it questions about itself: https://simonwillison.net/2025/Oct/24/claude-code-docs-map/

I've had great results from ChatGPT running the "GPT-5 Thinking" model since that almost always opts to run a search before it attempts to answer a question.

Here's what I got from that for your Switch 2 question: https://chatgpt.com/share/69089028-db8c-8006-b238-1d6946e791...

Screenshot of the searches it ran here: https://gist.github.com/simonw/048ffb895dd6b94419f0b4e066143...

Razengan 4 days ago | parent [-]

A month ago when I asked Claude (on the website) about its privacy options and stuff, it always pointed me to the Antrhopic website to look it up myself.

Another annoying example: I thought Google's Gemini would be search-first since, well, they're Google.

I asked Gemini to search for Airbnb rooms in an area and give me a summarized list.

It told me it can't and I could do it myself.

I told it again.

Again it told me it can't, but here's how I could do it myself.

I told it it sucks and that ChatGPT etc. can do it for me.

Then it went and I don't know, scrapped Airbnb or used a previous search it must have had, to pull up rooms with an Airbnb link to each.

This could actually be THE absolute killer app for a lot of people, if AI could plan your trip from a single sentence: "I'm free next week. I'd like to go to A, B, or C for a couple days. What's a cheap flight and a room within this budget near X area?" and if it could also go and make a booking through your accounts it would be orgasmic. Finally we would have what people in the 1960s thought computers would be doing in 2000 :')

But as it is, in their current state you have to wade through quite a bit of dumbassery.