Remix.run Logo
IgorPartola 2 days ago

As an LLM-skeptic who got a Claude subscription, the free models are both much dumber and configured for low latency and short dumb replies.

No it won’t replace my job this year or the next, but what Sonnet 4.5 and GPT 5 can do compared to e.g. Gemini Flash 2.5 is incredible. They for sure have their limits and do hallucinate quite a bit once the context they are holding gets messy enough but with careful guidance and context resets you can get some very serious work done with them.

I will give you an example of what it can’t do and what it can: I am working on a complicated financial library in Python that requires understanding nuanced parts of tax law. Best in class LLM cannot correctly write the library code because the core algorithm is just not intuitive. But it can:

1. Update all invocations of the library when I add non-optional parameters that in most cases have static values. This includes updating over 100 lengthy automated tests.

2. Refactor the library to be more streamlined and robust to use. In my case I was using dataclasses as the base interface into and out of it and it helped me split one set of classes into three: input, intermediate, and output while fully preserving functionality. This was a pattern it suggested after a changing requirement made the original interface not make nearly as much sense.

3. Point me to where the root cause of failing unit tests was after I changed the code.

4. Suggest and implement a suite of new automated tests (though its performance tests were useless enough for me to toss out in the end).

5. Create a mock external API for me to use based on available documentation from a vendor so I could work against something while the vendor contract is being negotiated.

6. Create comprehensive documentation on library use with examples of edge cases based on code and comments in the code. Also generate solid docstrings for every function and method where I didn’t have one.

7. Research thorny edge cases and compare my solutions to commercial ones.

8. Act as a rubber ducky when I had to make architectural decisions to help me choose the best option.

It did all of the above without errors or hallucinations. And it’s not that I am incapable of doing any of it, but it would have taken me longer and would have tested my patience when it comes to most of it. Manipulating boilerplate or documenting the semantic meaning between a dozen new parameters that control edge case behavior only relevant to very specific situations is not my favorite thing to do but an LLM does a great job of it.

I do wish LLMs were better than they are because for as much as the above worked well for me, I have also seen it do some really dumb stuff. But they already are way too good compared to what they should be able to do. Here is a short list of other things I had tried with them that isn’t code related that has worked incredibly well:

- explaining pop culture phenomenon. For example I had never understood why Dr Who fans take a goofy campy show aimed in my opinion at 12 year olds as seriously as if it was War and Peace. An LLM let me ask all the dumb questions I had about it in a way that explained it well.

- have a theological discussion on the problem of good and evil as well as the underpinnings of Christian and Judaic mythology.

- analyze in depth my music tastes in rock and roll and help fill in the gaps in terms of its evolution. It actually helped me identify why I like the music I like despite my tastes spanning a ton of genres, and specifically when it comes to rock, created one of the best and most well curated playlists I had ever seen. This is high praise for me since I pride myself on creating really good thematic playlists.

- help answer my questions about woodworking and vintage tool identification and restoration. This stuff would have taken ages to research on forums and the answers would still be filled with purism and biased opinions. The LLM was able to cut through the bullshit with some clever prompting (asking it to act as two competing master craftsmen).

- act as a writing critic. I occasionally like to write essays on random subjects. I would never trust an LLM to write an original essay for me but I do trust it to tell me when I am using repetitive language, when pacing and transitions are off, and crucially how to improve my writing style to take it from B level college student to what I consider to be close to professional writer in a variety of styles.

Again I want to emphasize that I am still very much on the side of there being a marketing and investment bubble and that what LLMs can do being way overhyped. But at the same time over the last few months I have been able to do all of the above just out of curiosity (the first coding example aside). These are things I would have never had the time or energy to get into otherwise.

boggsi2 2 days ago | parent [-]

You seem very thoughtful and careful about all this, but I wonder how you feel about the emergence of these abilities in just 3 years of development? What do you anticipate it will be capable of in the next 3 years?

With no disrespect I think you are about 6-12 months behind SOTA here, the majority of recent advances have come from long running task horizons. I would recommend to you try some kind of IDE integration or CLI tool, it feels a bit unnatural at first but once you adapt your style a bit, it is transformational. A lot of context sticking issues get solved on their own.

IgorPartola 2 days ago | parent [-]

Oh I am very much catching up. I am suing Claude Code primarily, and also have been playing a bit with all the latest API goodies from OpenAI and Anthropic, like custom tools, memory use, creating my own continuous compaction algorithm for a specific workflow I tried. There is a lot happening here very fast.

One thing that struck me: models are all trained starting 1-2 years ago. I think the training cutoff for Sonnet 4.5 is like May 2024. So I can only imagine with is being trained and tested currently. And also these models are just so ahead of things like Qwen and Llama for the types of semi-complex non-coding tasks I have tried (like interpreting my calendar events), that it isn’t even close.

boggsi2 2 days ago | parent [-]

[dead]