> For me LLM:s are just a computer interface you can program using natural language.

I wish they were, but they're not that yet because LLMs aren't very good at logical reasonsing. So it's more like an attempt to program using natural language. Sometimes it does what you ask, sometimes not.

I think "programming" implies that the machine will always do what you tell it, whatever the language, or reliably fail and say it can't be done because the "program" is contradictory, lacks sufficient detail, or doesn't have the necessary permissions/technical capabilities. If it only sometimes does what you ask, then it's not quite programming yet.

> Now - I can offload the most boring task to LLM and spend my mental energy on the interesting stuff!

I wish that, too, were true, and maybe it will be someday soon. But if I need to manually review the agent's output, then it doesn't feel like offloading much aside from the typing. All the same concentration and thought are still required, even for the boring things. If I could at least trust the agent to tell me if it did a good job or is unsure that would have been helpful, but we're not even there yet.

That's not to say the tools aren't useful, but they're not yet "programming in a natural language" and not yet able to "offload" stuff to.

▲ keeda 6 days ago | parent | next [-]

> ... LLMs aren't very good at logical reasonsing.

I'm curious about what experiences led you to that conclusion. IME, LLMs are very good at the type of logical reasoning required for most programming tasks. E.g. I only have to say something like "find the entries with the lowest X and highest Y that have a common Z from these N lists / maps / tables / files / etc." and it spits out mostly correct code instantly. I then review it and for any involved logic, rely on tests (also AI-generated) for correctness, where I find myself reviewing and tweaking the test cases much more than the business logic.

But then I do all that for all code anyway, including my own. So just starting off with a fully-fleshed out chunk of code, which typically looks like what I'd pictured in my head, is a huge load off my cognitive shoulders.

▲ pron 6 days ago | parent | next [-]

The experience was that I once asked an LLM to write a simple function and it produced something very wrong that nothing with good reasoning abilities should ever do. Of course, a drunk or very tired human could have done the same mistake, but they would have at least told me that they were impaired and unsure of their work.

I agree that most of the time it does most simple tasks mostly right, but that's not good enough to truly "offload" my mental effort. Again, I'm not saying it's not useful, but more than working with a junior developer it's like working with a junior developer who may or may not be drunk or tired and doesn't tell you.

But mostly my point is that LLMs seem to do logical reasoning worse than other things they do better, such as generating prose or summarising a document. Of course, even then you can't trust them yet.

> But then I do all that for all code anyway, including my own

I don't, at least not constantly. I review other people's code only towards the very end of a project, and in between I trust that they tell me about any pertinent challenge or insight, precisely so that I can focus on other things unless they draw my attention to something I need to think about.

I still think that working with a coding assistant is interesting and even exciting, but the experience of not being able to trust anything, for me at least, is unlike working with another person or with a tool and doesn't yet allow me to focus on other things. Maybe with more practice I could learn to work with something I can't trust at all.

▲

darkerside 6 days ago | parent | next [-]

> working with a junior developer who may or may not be drunk or tired and doesn't tell you.

Bad news, friend.

Overall though, I think you're right. It's a lot like working with people. The things you might be missing are that you can get better at this with practice, and that once you are multiplexing multiple Claudes, you can become hyper efficient. These are things I'm looking into now.

Do I know these for a fact? Not yet. But, like any tool, I'm sure that the investment won't pay off right away.

▲

kenjackson 6 days ago | parent | prev [-]

What was the simple function?

▲

throwaway31131 6 days ago | parent | next [-]

I’m not sure what their simple function was but I tried to use Claude to recreate C++ code to implement the algorithms in this paper as practice for me in LLM use and it didn’t go well. But I’ll be the first to admit that I’m probably holding it wrong.

https://users.cs.duke.edu/~reif/paper/chen/graph/graph.pdf

▲

pron 6 days ago | parent | prev [-]

Can't remember, but it was something very basic - a 10/15-line routine that a first-year student would write in 3 minutes if they knew the relevant API. The reason I asked the model in the first place is because I didn't know the API. If memory serves, the model inverted an if or a loop condition.

▲

p1esk 6 days ago | parent | next [-]

Did you use one of the latest frontier reasoning models? If not, how is your experience relevant?

▲

totallykvothe 6 days ago | parent | next [-]

In what world is this an appropriate thing to say to someone?

	▲	p1esk 6 days ago \| parent \| next [-]
		In the world where you do not claim that LLMs suck today based on your attempt to use some shitty model three years ago.
	▲	guappa 6 days ago \| parent \| prev [-]
		In the creed of "AI is perfect, if you claim otherwise you're broken" that so many here embrace.

▲

6 days ago | parent | prev [-]

[deleted]

▲

jama211 6 days ago | parent | prev [-]

So you tried it once and then gave up?

▲

pron 6 days ago | parent [-]

I didn't give up, I just know that I can only use a model when I have the patience to work with something I can't trust at all on anything. So that's what I do.

	▲	jama211 5 days ago \| parent [-]
		Sounds like the spirit of my question remains intact

▲ foobarbecue 6 days ago | parent | prev | next [-]

In your example, you didn't ask the LLM to do any logic. You asked it to translate your logic into code.

Asking an LLM to do logic would be saying something like: "I have a row of a million light switches. They all start off. I start at the beginning and flip on every fourth one. Then I flip on every eighth one, then sixteen, and all the powers of two until I'm over a million. Now I do the same for the powers of three, then four, then five, and so on. How many light switches are on at the end? Do not use any external coding tools for this; use your own reasoning."

Note that the prompt itself is intentionally ambiguous -- a human getting this question should say "I don't understand why you started with every fourth instead of every second. Are you skipping the first integer of every power series or just when the exponent is two?"

When I asked GPT5 to do it, it didn't care about that; instead it complimented me on my "crisp statement of the problem," roughly described a similar problem, and gave a belivable but incorrect answer 270,961 .

I then asked it to write python code to simulate my question. It got the code correct, and said "If you run this, you’ll see it matches the 270,961 result I gave earlier." except, that was a hallucination.

Running the code actually produced 252711.

I guess it went with 270,961 because that was a lexically similar answer to some lexically similar problems in the training data.

▲ cma 6 days ago | parent | next [-]

I would posit that most randomly selected AGIs (people) you ask this of with no tools allowed (pencil and paper) won't get closer on average. This doesn't prove people don't use logic

And allowing python shells for both I think the randomly selected human would do worse on average.

And I also think e.g. Google's IMO model would be able to pass but I have no way to verify that.

▲

pron 6 days ago | parent | next [-]

At work, you can trust people to either get the right answer or tell you they may not have the right answer. If someone is not trustworthy, you don't work with them again. The experience of trying to work with something that is completely not trustworthy on all fronts is novel and entirely dissimilar to working with either people or tools.

	▲	keeda 5 days ago \| parent [-]
		People themselves don't know when they are wrong, and that is why high-functioning organizations have all sorts of guardrails in place. Trivial example, code reviews. Now, code reviews are multi-purpose, and their primary benefit is not just catching bugs, but they do catch bugs pretty often (there are actual studies about this.) So my experience in working with AI is actually much more similar to working with people, except I have to correct the AI much less frequently. I always say, AI is technology that behaves like people, and so the trick to working with it effectively is to approach it like working with a colleague, with all their specific quirks and skillsets, rather than a tool.

▲

foobarbecue 6 days ago | parent | prev [-]

That's true about people with pencils and paper. But this is a computer, and my point is that with LLMs, you have a computer program that doesn't compute or do logic -- and for some reason people act like it does.

I mean, this is a super basic python task that I could write in 2 minutes. The naiive implementation is a couple of basic for loops. I wouldn't hire a junior dev that couldn't write it.

My point here is that for someone who can program, it's easier and better to write this in python than to write it in english and have an LLM translate it to python (and LLMs themselves are incapable of logic or reasoning, so the only option is for them to translate it to real code).

	▲	cma 6 days ago \| parent [-]
		I added this line to the end of your prompt so it knew it could clarify and didn't need an immediate answer: "You can ask any clarifying questions first before starting on the problem." Here's what it asked (Claude Opus, web search off so it didn't find your post): > Looking at this problem, I want to make sure I understand the process correctly before diving into the solution. Let me clarify a few things: > When you say "flip on every fourth one" for the powers of two, do you mean: > We flip switches at positions 4, 8, 12, 16, 20, 24, ... (every 4th position) Then flip switches at positions 8, 16, 24, 32, 40, 48, ... (every 8th position) And so on? > When you move to "powers of three", do you mean: > Start with every 3rd switch (3, 6, 9, 12, ...) Then every 9th switch (9, 18, 27, 36, ...) Then every 27th switch, etc.? > Or does "powers of three" mean we start with 3² = 9 as the first increment, similar to how we started with 2² = 4 for powers of two? > When you say "then four, then five, and so on" - do you mean we continue this process for all integers k ≥ 2, using powers of k as our increments? > The final state of each switch depends on whether it gets flipped an odd or even number of times total, so I want to make sure I understand the flipping pattern correctly before calculating the answer.

▲ keeda 5 days ago | parent | prev [-]

That may be the case, but how often do you have to provide these kinds of puzzles as programming tasks? And even then, by your own example, it did product correct code, so it did "reason" through the problem statement!

▲ foobarbecue 5 days ago | parent [-]

The point I'm making is that translating from english to python does not involve logical reasoning or any form of problem solving. You essentially replace english phrases with their python equivalents; it's a fuzzy version of looking things up in a dictionary.

It can translate, but it cannot reason or calculate. The python can calculate.

	▲	keeda 4 days ago \| parent [-]
		I've seen countless examples where it does do a lot of logical reasoning from a very high-level requirement. I only gave that example because that is the more common form of programming logic required, but there is a huge range of logical and coding tasks it handles seamlessly. It figures out the problem, the high-level approach, the appropriate algorithms, any intermediate book-keeping data structures required, any edge cases, what functions and classes from the contextual code matters, and how it should integrate with the rest of the codebase. Here is one recent exchange (some proprietary details redacted) with Claude Sonnet 3.7 via the Cody VSCode Extension (RIP): --- Me: At this point, "segments" represents a list of bounding boxes that can overlap. Write a function to coalesce only those boxes that are substantially (i.e. within some tolerance) horizontally aligned along one or both edges or their centers, either: 1) overlapping completely or partially; or 2) very closely spaced like the words of a sentence should be (you can assume the threshold for spacing could be a reasonable fraction of the height of the box). Claude: I'll create a function to coalesce horizontally aligned boxes based on your requirements. This function will identify boxes that are either overlapping or closely spaced and horizontally aligned (by top edge, bottom edge, or vertical center). ```javascript function coalesceHorizontallyAlignedSegments(segments, options = {}) { // Default options const { verticalAlignmentTolerance = 0.1, // 10% tolerance for vertical alignment horizontalSpacingThreshold = 0.5, // Max horizontal spacing as a fraction of box height overlapThreshold = 0.2 // Minimum overlap required (20% of the smaller box) } = options; if (!segments \|\| segments.length <= 1) { return segments; } // Sort segments by x position (left to right) const sortedSegments = [...segments].sort((a, b) => a.x - b.x); // Track which segments have been merged const merged = new Array(sortedSegments.length).fill(false); const mergedSegments = []; for (let i = 0; i < sortedSegments.length; i++) { // Skip if this segment has already been merged if (merged[i]) continue; const currentSegment = sortedSegments[i]; let mergedSegment = { ...currentSegment }; merged[i] = true; // Check for potential merges with subsequent segments let didMerge = true; while (didMerge) { didMerge = false; for (let j = 0; j < sortedSegments.length; j++) { if (merged[j]) continue; const nextSegment = sortedSegments[j]; // Check if horizontally aligned (top, bottom, or center) const topAligned = Math.abs(mergedSegment.y - nextSegment.y) <= verticalAlignmentTolerance * Math.min(mergedSegment.height, nextSegment.height); const bottomAligned = Math.abs((mergedSegment.y + mergedSegment.height) - (nextSegment.y + nextSegment.height)) <= verticalAlignmentTolerance * Math.min(mergedSegment.height, nextSegment.height); const centerAligned = Math.abs((mergedSegment.y + mergedSegment.height/2) - (nextSegment.y + nextSegment.height/2)) <= verticalAlignmentTolerance * Math.min(mergedSegment.height, nextSegment.height); const isVerticallyAligned = topAligned \|\| bottomAligned \|\| centerAligned; if (!isVerticallyAligned) continue; // Check for overlap or close spacing const mergedRight = mergedSegment.x + mergedSegment.width; const nextLeft = nextSegment.x; // Calculate horizontal spacing const horizontalSpacing = nextLeft - mergedRight; const maxAllowedSpacing = horizontalSpacingThreshold * Math.min(mergedSegment.height, nextSegment.height); // Check for overlap const isOverlapping = mergedRight >= nextLeft; // Calculate overlap percentage if overlapping let hasSignificantOverlap = false; if (isOverlapping) { const overlapWidth = Math.min(mergedRight, nextSegment.x + nextSegment.width) - Math.max(mergedSegment.x, nextSegment.x); const smallerWidth = Math.min(mergedSegment.width, nextSegment.width); hasSignificantOverlap = overlapWidth / smallerWidth >= overlapThreshold; } // Merge if overlapping significantly or closely spaced if (hasSignificantOverlap \|\| (horizontalSpacing >= 0 && horizontalSpacing <= maxAllowedSpacing)) { // Create a new bounding box that encompasses both segments const newX = Math.min(mergedSegment.x, nextSegment.x); const newY = Math.min(mergedSegment.y, nextSegment.y); const newRight = Math.max(mergedSegment.x + mergedSegment.width, nextSegment.x + nextSegment.width); const newBottom = Math.max(mergedSegment.y + mergedSegment.height, nextSegment.y + nextSegment.height); mergedSegment = { x: newX, y: newY, width: newRight - newX, height: newBottom - newY, // Preserve the virtual path from the leftmost segment imagePath: mergedSegment.x <= nextSegment.x ? mergedSegment.imagePath : nextSegment.imagePath }; merged[j] = true; didMerge = true; break; } } } // Add the topLeft and bottomRight properties for consistency mergedSegment.topLeft = { x: mergedSegment.x, y: mergedSegment.y }; mergedSegment.bottomRight = { x: mergedSegment.x + mergedSegment.width, y: mergedSegment.y + mergedSegment.height }; mergedSegments.push(mergedSegment); } return mergedSegments; } ``` This function works as follows: 1. It sorts segments from left to right to process them in reading order 2. For each segment, it looks for potential merges with other segments based on: - Vertical alignment (top edges, bottom edges, or centers within tolerance) - Either significant overlap (controlled by overlapThreshold) or - Close horizontal spacing (controlled by horizontalSpacingThreshold as a fraction of box height) 3. When segments are merged, it creates a new bounding box that encompasses both 4. It continues merging until no more merges are possible You can integrate this function into your segmentation pipeline by adding it after the segment extraction: ```javascript function segmentImage(ctx, imageData, options = {}) { // ... existing code ... `const segments = boundingRects.map(toSegment); // Coalesce horizontally aligned segments const coalescedSegments = coalesceHorizontallyAlignedSegments(segments, options.coalesceOptions); return { segments: coalescedSegments, contourTrees };` } ``` Me: Do not write any code yet, but would it help if the segments were first grouped into approximate "rows" and then coalesced, rather than simply sorted by the x-coordinate? Claude: Yes, grouping segments into approximate "rows" before coalescing would be significantly more effective than simply sorting by x-coordinate. Here's why: 1. Better handling of multi-column layouts: When a page has multiple columns of text or elements, sorting purely by x-coordinate would attempt to merge segments across different columns if they happen to be horizontally aligned. 2. Respecting natural reading flow: Text typically flows in rows, and grouping by rows first would respect this natural organization of content. 3. More accurate merging decisions: By first establishing which segments belong to the same "row" or line of content, we can make more informed decisions about which segments should be considered for merging. 4. Handling of varying vertical positions: Text lines on a webpage aren't always perfectly aligned. Pre-grouping into rows would allow us to handle slight variations in vertical positioning within what should be considered the same line. 5. Efficiency: By limiting the coalescing operations to segments within the same row, we reduce the number of comparisons needed. The approach would work like this: 1. Group segments into rows based on vertical overlap or proximity 2. Within each row, sort segments by x-coordinate 3. Apply the coalescing logic only to segments within the same row This approach would better preserve the semantic structure of the content while still merging segments that represent parts of the same logical unit (like words in a sentence). Looking at the codebase, this approach would align well with how visual content is typically structured and would produce more meaningful segment groupings for downstream processing. --- And then it proceeded to update the code as discussed. Sure this is not a very novel problem and it required a bit of back and forth, but look at the ratio of prompt to code. This exchange took a couple of minutes; I'd estimate it would have taken me an hour to get that code down with all edge cases handled. Look at the exposition, the quality of code, the choice to use optional parameters for tolerances, and edge-case handling. It's very, very hard for me to not see this as reasoning. I mean, how is this not mind-blowing?

▲ __MatrixMan__ 6 days ago | parent | prev | next [-]

I'm not who you're replying to but I had a scenario where I needed to notice that a command had completed (exit code received) but keep listening for any output that was still buffered and only stop processing tokens after it had been quiet for a little bit.

Trying to get Claude to do this without introducing a deadlock and without exiting too early and leaving valuable output in the pipe was hellish.

It's very good at some kinds of reasoning and very bad at others. There's not much it's mediocre at.

	▲	keeda 5 days ago \| parent [-]
		> It's very good at some kinds of reasoning and very bad at others. There's not much it's mediocre at. Yes, one of the terms for this phenomenon is "jagged frontier" from here: https://www.oneusefulthing.org/p/centaurs-and-cyborgs-on-the... This is also the main reason people have such diverging experiences and the discussion around AI is so bimodal in nature.

▲ jpfromlondon 5 days ago | parent | prev [-]

https://arstechnica.com/ai/2025/08/researchers-find-llms-are...

▲ yomismoaqui 7 days ago | parent | prev | next [-]

You can view Claude Code as a non-deterministic compiler where you input english and get functioning code on the other end.

The non-determinism is not as much as a problem because you are reading over the results and validating that what it is created matches what you tell it to do.

I'm not talking about vibe-coding here, I'm grabbing the steering wheel with both hands because this car allows me to go faster than if I was driving myself, but sometimes you have to steer or brake. And the analogy favors Claude Code here because you don't have to react in milliseconds while programming.

TL;DR: if you do the commit you are responsible for the code it contains.

▲

pron 7 days ago | parent [-]

Sure, and that may be valuable, but it's neither "programming" nor "offloading mental effort" (at least not much).

Some have compared it to working with a very junior programmer. I haven't done that in a long while, but when I did, it didn't really feel like I was "offloading" much, and I could still trust even the most junior programmer to tell me whether the job was done well or not (and of any difficulties they encountered or insight they've learnt) much more than I can an agent, at least today.

Trust is something we have, for the most part, when we work with either other people or with tools. Working without (or with little) trust is something quite novel. Personally, I don't mind that an agent can't accomplish many tasks; I mind a great deal that I can't trust it to tell me whether it was able to do what I asked or not.

	▲	fsloth 6 days ago \| parent \| next [-]
		”it's neither "programming" Sure it is. Modern ecosystem is sadly full of API:s like WPF on Windows that are both verbose and configuration heavy. Now, some people may be able to internalize xaml with little effort but not all us - and then you basically move forward iteratively, looking for code example, trying this or that … basically random walking towards something usable. Or you use an agentic LLM and it does this peeking and poking for you, and with decades old APIs like WPF likely has enough context to do the thing you asked it to do far more competently than you could train yourself to program WPF in a few days. Of course in the context of this example WPF was your main duty, you _would_ learn the ins and outs of it. In quite many jobs a task like this may not infact be even among your top ten duties but you need to do it. In these sort of situations a LLM is really nice. The worst it provides is a good first guess how to do something. If your contex is something like radiology treatment then no, don’t use LLM! But there are thoushands of miserable non-critical but necessary components in production for which LLM is just fine.
	▲	gf000 6 days ago \| parent \| prev [-]
		There is probably a significant factor of what domain you work in, that might be part of the reason why you have different experiences. I don't know your current domain, but stuff like Loom requires very complex reasoning capabilities, most of which lives outside the actual codebase itself. Business code on the other hand often has a more direct mapping from idea to code, so LLMs might show more promise there.

▲ nostrademons 6 days ago | parent | prev [-]

So here's a transcript of a recent Claude Code interaction I had, trying to build an MP3 organizer that relies on heuristics and fuzzy matching to make sense out of all the random MP3s I downloaded in the 90s. This all happened in-between prompts: I gave it a correct example and an incorrect example and asked it why it was giving the incorrect example.

● Let me trace through what happens with the "Jordan Rudess" example:

  1. Directory: Jordan Rudess/
  2. file_rel_path would be Path("Jordan Rudess") (after removing MP3 Disk prefix)
  3. If Jordan Rudess matches in path component 0, scope_path would be Path("Jordan Rudess")
  4. _path_is_in_scope checks if "Jordan Rudess".startswith("Jordan Rudess") → True

  This looks correct. Let me check the inference logic:

● Read(mp3_organizer_v2.py) ⎿ Read 15 lines (ctrl+r to expand)

● The inference priority looks correct - path_component_match is indeed the highest priority. But I think I see the issue now. Let me look at the actual logic flow in the main organize method:

● Read(mp3_organizer_v2.py) ⎿ Read 10 lines (ctrl+r to expand)

● The flow looks right. Let me examine the actual problem more carefully. Looking back at the _apply_path_component_matching method, I think there might be an issue with when the matching happens. Let me check the logic:

● Update Todos ⎿ Check _path_is_in_scope logic for correctness Verify artist inference priority logic works with path_component_match Test and fix any remaining issues with artist propagation

● I think I see the issue now. The problem is that the current code only looks for matches in files that already have file_id3 tags, but it needs to propagate those matches to all files in the directory. However, the current implementation is still only applying matches to files that have matching candidates, not ALL files. Let me check the exact logic:

● Read(mp3_organizer_v2.py) ⎿ Read 8 lines (ctrl+r to expand)

● Actually, looking at this more carefully, the logic should work. The matched_artists dictionary contains the matches found from ANY file that had a file_id3 tag matching a path component, and then it applies that match to ALL files in the file_infos list via the loop at line 737.

  Wait, let me understand the scoping issue better. Let me check what file_rel_path and scope_path would actually be in your example:

● Read(mp3_organizer_v2.py) ⎿ Read 5 lines (ctrl+r to expand)

● Ah! I think I see a potential issue. The file_rel_path is calculated as the directory path, not the individual file path. So for all files in the same directory, file_rel_path will be the same. This should actually work correctly.

  But let me double-check the scope matching logic. If the directory is "Jordan Rudess" and the scope_path is also "Jordan Rudess", then _path_is_in_scope should return
  True.

  Let me debug this by adding some debug output to see what's actually happening. But first, let me look at a potential issue: maybe the match isn't being found in the first place?

It's doing a pretty complex chain of step-by-step reasoning here, looking in detail at the code. It ended up by printing out more debug info and having me re-run it again, then paste in the debug info, then add even more debug info and run it again. But it did eventually get the bug, which was non-trivial to identify (it was an aliasing problem where Claude was mutating state on a list and that list was shared with all the other files in the directory).

▲

Applejinx 6 days ago | parent [-]

Huh. Alternate explanation: there's a layer of indirection, drawing upon the unthinkable size of the source data, so rather than 'issue forth tokens as if there is a person answering a question', you've got 'issue forth tokens as if there is a person being challenged to talk about their process', something that's also in the training data but in different contexts.

I'm not sure statements of 'aha, I see it now!' are meaningful in this context. Surely this is just the em-dash of 'issue tokens to have the user react like you're thinking'?

	▲	nostrademons 5 days ago \| parent [-]
		I wonder if something else is going on, and perhaps Claude is using the LLM to identify the likely culprits within the codebase, sending the code around them to execute with an actual Python interpreter on their servers, feeding both the code and the result as the context window to another LLM query with a system prompt something like "What is this code doing, when it runs on this input and this output?", feeding the result of that back to the user, and then repeating as long as the overall bug remains unsolved. I've found that feedback is a very effective technique with LLMs, asking them to extract some data, testing that data through out-of-band mechanisms, then feeding the test results and the original context back into the LLM to explain its reasoning and why it got the result. The attention mechanisms in the transformer model function very well when they're prompted with specifics and asked to explain their reasoning. Only an Anthropic engineer would know for sure. I'm pretty sure that it was making multiple queries on my behalf during the chat transcript - each "Read ... mp3organizer_v2.py" is a separate network round-trip.