| ▲ | microtonal 10 hours ago |
| Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models. I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions. I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap. |
|
| ▲ | Brendinooo 9 hours ago | parent | next [-] |
| Yeah, there's a real opportunity for one of these companies to invest time in a model that's tuned for, to use your term, agent-assisted developement. Trouble is, everyone inside their buildings seems to believe that no one will be working like that in a year or two. |
| |
| ▲ | everforward 9 hours ago | parent | next [-] | | There’s no way to justify their valuations if they get downgraded to a pair programming tool. They need fully agentic stuff to work and replace human engineers to even come close. Offhand, I’m not even certain whether a model like that could justify the constant retraining we’re doing on the agentic models. It doesn’t make a lot of sense to spend millions or billions on training to reduce hallucinations by 0.3% if your model assumes a human is in the loop to course-correct them. | | |
| ▲ | keeda 8 hours ago | parent | next [-] | | Some napkin math -- total global labor compensation is about 50% of the GDP, which puts it in the USD 50 - 60 Trillion range: https://ourworldindata.org/grapher/labor-share-of-gdp This source claims that knowledge workers alone (probably because they are paid much more) account for 35 - 50 Trillion of that: https://github.com/danielmiessler/Substrate/blob/main/Data/K... If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value annually. Even if the AI industry can capture a fraction of that, that is a huuuge monetization opportunity. Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.) Now, I don't think AGI will happen soon (or has already happened, depending on how you define it) but I do think humans will be a much smaller part of the loop and large-scale job displacement will happen once companies figure out how to properly use AI. At this point, the financial upside for the AI industry is extremely high but will be limited by the social turmoil that will inevitably ensue (which we're already seeing brewing in the data center backlash.) | | |
| ▲ | e9 8 hours ago | parent | next [-] | | I want to propose alternative reality where 1.5-2.5T in value doesn't go to a handful of companies. Instead it turns out to be like restaurants where this gets distributed to lots and lots of small, local, mostly interchangeable teams. There will of course be some super star "chefs" leading the industry and setting trends and some "restaurant chain" like big businesses and supply chain for all of this. | | |
| ▲ | keeda 5 hours ago | parent | next [-] | | FWIW I do think that availability of competitive open weight and other non-frontier models, along with improvements in harnesses that can get good results out of these models, will result in less concentration and a healthier marketplace. However, these frontier labs are also making moves that could let them capture a disproportionate share of the upside. One possibility is a situation analogous to the smartphone manufacturing space, where there are dozens of players but just a handful (e.g. Apple, Samsung in smartphones) capture the lion's share of the revenue. | | |
| ▲ | skeptic_ai 3 hours ago | parent [-] | | Apple you can’t exit the ecosystem. Samsung the same. And is the best android device. If tomorrow comes a Nokia os will be dead in the water: it has no apps. But with a new llm that doesn’t matter. There is nothing sticky about typing Gemini, Claude or codex in a cli. | | |
| ▲ | keeda an hour ago | parent [-] | | There's nothing sticky today but you can bet they're working maniacally to fix that. These companies will make most of their money in the enterprise space and there are probably unlimited ways to engineer stickiness in an enterprise setting. Like, MSFT still rakes in those billions despite pretty much every one of their products having commodity competitors. The AI labs are also making moves to secure long-term enterprise presence, such as their Forward Deployed Engineer strategy. I think that is a trojan horse play that could make enterprises dependent on them forever, much like so many companies are still dependent on IBM's mainframes. As an extreme example, you could imagine a company's core business logic encoded in the weights of a proprietary model custom-trained and hosted by one of these model providers, something even more inscrutable and sticky than ancient COBOL codebases. |
|
| |
| ▲ | xxpor 7 hours ago | parent | prev | next [-] | | The world is not zero sum. Value is created, not just preserved. Anthropic and OpenAI creating value does not imply that smaller guys can not also create value. | | |
| ▲ | afavour 7 hours ago | parent [-] | | But marketplaces also exist and big players in a marketplace are often able to manipulate the market such that they are advantaged and small players are not able to break in. | | |
| ▲ | mpyne 6 hours ago | parent [-] | | This is true of every market that has ever existed, and that's not stopped small players from finding niches. |
|
| |
| ▲ | bdamm 7 hours ago | parent | prev | next [-] | | How? Training and operating models seems to naturally focus on those willing to invest quite significantly in these operations. | | |
| ▲ | nish__ 6 hours ago | parent [-] | | If RAM prices come down, running your own models will be relatively affordable. |
| |
| ▲ | actionfromafar 7 hours ago | parent | prev [-] | | Sysco is pretty big. |
| |
| ▲ | ricardobayes 7 hours ago | parent | prev | next [-] | | I am deeply surprised by the silence of philosophers, sociologists, liberal arts majors, economists. Where are the think tanks who contemplate and debate the societal aspects? The tech is advancing full steam but the "other side" doesn't feel anywhere nearly ready. | | |
| ▲ | bloppe 7 hours ago | parent | next [-] | | Idk why you're perceiving silence. Feels to me like this is the main thing people talk about nowadays. | | |
| ▲ | scarmig 7 hours ago | parent [-] | | It has to do with the scope of what they're discussing. It seems extraordinarily small: e.g. what if AI increases productivity growth by 0.4%? Do data centers use too much water? Are AIs racist when reviewing resumes? The frontier labs, on the other hand, are thinking about replacing all human labor, ending death, and the risk of it causing human extinction. Most of the apparatus we're talking about approach it very parochially; it's almost like they're embarrassed to take the grander ideas even a little seriously, for being too nerdy/sci-fi. | | |
| ▲ | freejazz 6 hours ago | parent [-] | | The public would happily string up any of these CEOs if given the chance |
|
| |
| ▲ | bdamm 7 hours ago | parent | prev | next [-] | | Because the "other side" is busy trying to anthropomorphise AI into solving the trolly problem, while being mostly clueless about the actual problems. They'll show up after the fact and whinge endlessly about how they should have been involved. | | |
| ▲ | DrewADesign 4 hours ago | parent [-] | | I guess the real problems are things like people not being allowed to post AI-generated images in digital drawing, painting, and photography communities, because I see a lot of boosters ceaselessly whining about that abject “discrimination”, despite having plenty of places where people post all kinds of that garbage all the time. Or maybe every cultural group has its own set of whiners and we always think the ones we disagree with are the loudest. |
| |
| ▲ | digitaltrees 7 hours ago | parent | prev | next [-] | | Reid Blackmun has written several books and has a consultanting agency to guide ethical implementation of AI | |
| ▲ | freejazz 6 hours ago | parent | prev | next [-] | | Silence? Even the pope has come out against AI? Who hasn't? Diplo?? | |
| ▲ | DrewADesign 4 hours ago | parent | prev [-] | | Sometimes the great algorithmic gods give us a glimpse of our own bubble. | | |
| |
| ▲ | everforward 6 hours ago | parent | prev | next [-] | | > Note, at 5% productivity boost, humans are not just in the loop, they are the loop. AGI or large-scale replacement of humans is not even needed, but the financial opportunity is already immense, and it scales with how much human productivity can be improved (i.e. how much work can be offloaded to LLMs.) The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed, which for me would probably translate to something like a 3% increase in productivity. I spend a lot more time on things like getting agreement between teams, documenting approaches to things that don't exist on the wiki, etc, that LLMs are significantly less effective at. Or just can't do; no one will be happy if I send an LLM instead of me to meetings. I suspect a lot of roles are like that. They give a 10-30% boost to the core role function, but that core role is still only 30-50% of what you do. > that is ~1.5 - 2.5T in value annually That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap. And this is before we consider that they need to do it for cheaper or why would anyone bother. Realistically, potential revenue is probably half that at best. It's also before cutthroat pricing really kicks in. People are willing to pay for Claude right now; I still suspect that as time goes on people will start looking towards Deepseek/GLM/etc models that provide 95% of the performance at 10% of the price. That'll cut the market even further. The question is how much demand for knowledge work swells as prices fall, and whether that's a soft landing or a crash. | | |
| ▲ | keeda 3 hours ago | parent [-] | | > That seems really large, but it's ~2-3x Walmart's yearly revenue, and OpenAI and Anthropic both have estimated valuations that compare to Walmart's market cap.
... It's also before cutthroat pricing really kicks in. Right, that's more of an estimate on the value proposition of the overall AI industry, rather than valuations of the industry or specific players. While I don't think OpenAI and Anthropic will capture all of the potential upside, I do suspect they will do much better than other players despite the competition (https://news.ycombinator.com/item?id=48740472) > And this is before we consider that they need to do it for cheaper or why would anyone bother. Typically yes, but there are reasons companies may be willing to pay the same amount or even more, such as "AI doesn't need sleep, holidays, insurance, or benefits" and "AI is easier to procure and replace than humans." > The studies I've seen recently (at least in the software space) put it at something like a 10% increase in coding speed... Curious to see which studies you're looking at, the studies I'm thinking of (some here: https://news.ycombinator.com/item?id=45379452) are from 2024 - 2025, so already old and before agents really took off. However, your point about meetings and agreements and documenting is much more germane. My theory is that the largest productivity gains -- and subsequent labor displacement -- will come from reducing coordination overhead: https://news.ycombinator.com/item?id=48040999 |
| |
| ▲ | danenania 7 hours ago | parent | prev | next [-] | | I’d also point out that LLM inference revenue already totals more than 100B annually based on publicly reported numbers. Almost none of that is replacing knowledge workers. Almost all is increasing their productivity. So empirically what you describe is already happening to a nontrivial degree. | |
| ▲ | hedora 7 hours ago | parent | prev | next [-] | | You’re trying to apply value based pricing (infinite margin upside) to a commodity. Pre-bubble pricing: $1400 gets a 128GiB iGPU optimized for inference. Glm and kimi need 800-1000GiB. Call it 1TiB. The $1400 boxes could be ganged into sets of 4-8, with a switch. Call the switch $1000. Each box has a TDP of 250W. 8 x 250/120V = 16.666A, or one household circuit in the US, so no new power infrastructure is needed. $1400 x 8+1000=$12,200. Assuming standard five year depreciation, that’s $2440 a year. There are a billion knowledge workers alive today. So that’s $2.4T annual revenue. Average net profit margins on computer hardware are 4.3%. That works out to $105B net income, globally. So, I guess the question is whether the (currently #2) open weight models provide $1.4-2.4T less value per year than the #1 and #3 models, and, if so, if customers can measure this, or are willing to spend 2x more and deal with censorship, data theft, intentional enshitification, sabotage, ads, product placement, etc, to get the slightly “better” model. Also, note that my numbers assume moore’s law stopped for all time in 2024, but we’ve seen HW improvements since then. | | |
| ▲ | keeda 2 hours ago | parent [-] | | Right, that number is more of an estimate of the value proposition of the entire AI industry rather than projections of revenue or valuations... it's essentially an estimate on how much the market could theoretically bear. Whether the companies can capture that value is, to your point, rightly a different question. I do think open weight and other competitor models, especially with better harnesses, will play a significant role in the equation and will result in less concentration in the market. However, I do also think the big AI companies will capture a lot of that value. Partially for the same reasons that the cloud industry has been growing like gangbusters, even pre-AI, despite on-prem being much cheaper: companies will outsource anything that is not deemed a "core competency" for their business. A lot of the problems you mentioned will be relegated to the consumer market and won't apply to enterprise contracts -- which is where the real money is. |
| |
| ▲ | parineum 7 hours ago | parent | prev | next [-] | | > If LLMs can boost their productivity even by an average of 5% (studies from ~2024 put it in the ~30% range depending on task) that is ~1.5 - 2.5T in value Minus the cost of inference, that might not be the boon you're making it out to be. I hear what people around here are spending on their api and I'm skeptical that these tools are making me that much more productive. Personally, for assisted development, I haven't seen much progress in a while. | |
| ▲ | 4rf 3 hours ago | parent | prev [-] | | What a load of nonsense lmao. Pls stop posting you are creating noise. |
| |
| ▲ | overgard 9 hours ago | parent | prev | next [-] | | That's a really good point. I think if there wasn't the insane amount of money involved and these were treated as tools instead, they would probably be MORE productive. I think a person working hand in hand with an AI instead of delegating is the sweet spot of making things fast while also not losing understanding or control of the system. You are absolutely right that these companies can't justify their valuations if they do that though. I just got a new mac to run models locally, and so far the results have been positive with some small hiccups. I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx" | | |
| ▲ | everforward 5 hours ago | parent | next [-] | | > I'm thinking the future of this tech will likely be better tooling with better IDE integrations rather than "Claude plz make me a SaaS kthx" I think this sort of thinking is a trap, because it presumes that all software has the same constraints. There's a spectrum of requirements between "chuck this over the wall at Claude, it only has to work once" and "this is a literal rocket ship, formally verify the whole thing". I've made some things with Claude I don't understand and don't control. It's fine, they're still useful to me. Things for the house that I wasn't going to build manually, some dashboarding stuff and scripts for work, stuff that can crash and burn and I'll be fine. They won't justify trillions in investment, but they are useful. Equally, I do agree with you on some things. Sometimes I hand-hold the LLM or forgo it entirely because I want to be 100% sure I know how something works, and can justify a decision if it causes a production outage. I think the future is probably multiple different tools with different goals. Better IDE integration for some uses, an entirely separate "LLM herd controller" kind of thing for when you're okay with vibe-coding, and the most interesting is something in the middle where you're more in the loop than pure vibe-coding, but don't see the full context like in an IDE. Something where it surfaces changes to key components, but hides things like test changes. | | |
| ▲ | balder1991 4 hours ago | parent [-] | | It’s what’s called in software engineering as “casual software” as a differentiator of “business software” and “critical software”. Not all types needs a high bar of quality, and most of the software engineering thought practices are tailored for business applications that will be made available to multiple users. As you said, building a script that only you use personally or a very simple thing that just accomplishes one task and it’s easy to test require almost no engineering, and an LLM can often build those with very little downsides. |
| |
| ▲ | ah1508 6 hours ago | parent | prev | next [-] | | > while also not losing understanding That's a key point. Keeping knowledge and know how inside the company is strategic. For most people GPS did not result in better sense of direction, spellchecking did not help to write without making mistakes, and delegating translation to deepl does help to be better in a foreign languages. I don't see the gain for an individual, a company, a society if a technology reduces the ability to think, do stuff, understand complex problem, working hard at something. Hiring junior also matters, what is boring for a senior dev is useful for a junior, like the "wax on wax off" in Karatekid. Then when the senior dev retired the junior is not junior anymore and the know how is still here. I want to to transfer my knowledge to a junior, not to anthropic or google or openai. Ideally, working hand in hand with an AI could be like driving a motorcycle vs riding a bicycle. Both are fine, but you go much faster with a motorcycle and you don't lose any ability. But prompting a motorcycle auto-pilot by voice sound a bit stupid and boring. Insane use of energy rarely comes into the equation, which is a bit weird. Personally it is why I am never tempted to use AI. However I see value in AI for finding weakness in a code (inverse of flattery), writing tests with all the edge cases based on specs since tests are often sloppy, asking a fresh view on a very difficult problem. I'd love to hear about the equivalent of move#32 in game 2 (AlphaGo vs Lee Sedol) in a difficult programming task. But I think that massive delegation of code writing is how you lose the knowledge and the know how: what keeps us sharp. Final word: I asked once a review to claude, the codes involved a db transaction. Nothing complicated, Claude said everything was fine. However the transaction isolation level was not set (I did it on purpose, like if I did not know about isolation levels). He did not ask me if it was my intention to keep the default level. I would have preferred a challenging feedback: why did you chose the default isolation level ? Is it on purpose ? Do you know that the default depend on the db ? Do you know about isolation ? Tell me about the business use case and I'll explain which one would be the best. | |
| ▲ | user43928 7 hours ago | parent | prev [-] | | I am thinking the opposite. I've been having great results with handing more and more responsibilities to the agent. Contrary to what some people suggest, I have not hit any maintenance or reliability dead ends. If something breaks, the agent fixes it. If it cannot, I have the agent instrument the code and work through the logs to check hypotheses, until the source of the issue is found. If even that would fail, which did not yet happen, I can still do some old fashioned digging and learning, like I always have. This is for native mobile app development, and the code base is around 100k LOC. |
| |
| ▲ | tskj 8 hours ago | parent | prev | next [-] | | Dario has publicly claimed each model has been profitable, even accounting for its training costs; it's just that each new model is exponentially more expensive to train than the last, so the income lags and it looks like the company is losing money overall. Now, we can't know if this is true unfortunately, but it's not directly contradicted by anything that's known publicly at least. I thought it was an interesting way to frame it and makes the whole situation look marginally less bad. | | |
| ▲ | NorwegianDude 6 hours ago | parent | next [-] | | A common extreme misconception is that inference is expensive and that providers are loosing a lot of money. Inference is extremely lucrative and profitable. | | |
| ▲ | drob518 3 hours ago | parent [-] | | Inference is the phase where they make money. But the question is whether they can be profitable overall as training continues to balloon. |
| |
| ▲ | 4rf 3 hours ago | parent | prev [-] | | why are you listening to these idiots who have every incentive to spin the story as much as possible FCFF = EBIT(1-t)-Reinvestment I dont care about your gross profit - this kind of cash profit determines the value of operating assets. |
| |
| ▲ | sanderjd 9 hours ago | parent | prev | next [-] | | My two cents is that the way to square this circle is that the valuations should be lower and they should be spending a lot less on constant retraining. Unfortunately (from my perspective) it seems like the US companies are increasingly stuck in their current model. I think it's a competitive disadvantage. But obviously most of the real insiders seem to disagree with me, so I'm probably wrong :) | | |
| ▲ | wyre 8 hours ago | parent [-] | | The insiders disagree because they are benefiting greatly from the insane valuations, right? Chinese models are quickly commodifying frontier inference, the US Gov is preventing domestic SOTA models access to the public and without those models why would consumers still spend $200/month to use the best models? It’s such a mess and isn’t inspiring confidence as a non-investor. | | |
| ▲ | sanderjd 8 hours ago | parent [-] | | Are they benefiting from the insane valuations though? If the valuations deflate before the insiders are able to exit, I think that would be worse for them than a lower but sustainable valuation. It all comes down to whose prediction of the future is closer to correct. I think the most likely future is commodification of inference and "agent-assisted" rather than "agent-driven" workflows dominating the future of work. But insiders - who both know way more than me, and also have more skin in the game, both for better and worse - seem to really think I'm wrong about that. So I dunno! Could go either way! | | |
| ▲ | drob518 3 hours ago | parent | next [-] | | It’s all about timing. This is tech bubble 2.0, Dotcom Boogaloo. If you’re able to flip it quickly, you’ll have generational wealth. If not, you could be holding a lot of worthless paper. | | |
| ▲ | sanderjd an hour ago | parent [-] | | Yes. But is your impression that this is the strategy of people like Amodei? My impression is that it isn't, that they are actually true believers, and not just trying to hit the timing right and flip it. |
| |
| ▲ | wyre 6 hours ago | parent | prev [-] | | Even if the future is agent-driven workflow, that doesn't stop the commodification of inference. a good agent-driven workflow, in my experience, is a byproduct of the harness and scaffolding around the agent. What insiders are you talking about? They're going to be hot towards the possibilities so they can exit to a massive windfall. I dont know why they would want to be publicly critical of these technologies that could make millions on IPO. | | |
| ▲ | sanderjd 6 hours ago | parent [-] | | I'm talking about people who work at the frontier labs who talk to the press, and what seems to be the revealed beliefs of those same people from the strategies we see their companies pursuing. My point is that actually it would be worse for these people if the valuations are only high during this period - which will last awhile longer from now! - where their equity is not liquid, but crashes as the market figures out this commoditization thing. But if we're wrong about how that's going to go, then this isn't a concern because there won't be any devaluation. And to me that seems to be what they honestly think is going to happen. And they know more than me (and I think they're a lot smarter than me), so this does temper my confidence in my own predictions. |
|
|
|
| |
| ▲ | ricardobayes 7 hours ago | parent | prev | next [-] | | At some point it's going to plateau, maybe already has. Then they will switch to FPGA/ASIC-based model-specific hardware for lower consumption. I'm pretty sure the "space data centers" won't use GPUs, they are not radiation-tolerant whereas FPGAs can be. https://www.cerebras.ai/blog/gemma-4-on-cerebras-the-fastest... | | |
| ▲ | quaverquaver 7 hours ago | parent [-] | | I would not take "space data centers" as a given! from most to least likely these will be vaporware, vaprorized-ware, rubble-ware, loss leaders. | | |
| |
| ▲ | JumpCrisscross 9 hours ago | parent | prev | next [-] | | > no way to justify their valuations if they get downgraded to a pair programming tool I think there is. Pair today doesn’t mean they’re locked into that forever. | | |
| ▲ | 4rf 3 hours ago | parent | next [-] | | you always post about valuations but never share your own. go ahead m8 we are all waiting... the stage is yours. lets see your model. | |
| ▲ | ChrisLTD 6 hours ago | parent | prev [-] | | Their valuations don't make sense as just programming tools, period. Forget about if they are still human driven. |
| |
| ▲ | EddieRingle 7 hours ago | parent | prev [-] | | > There’s no way to justify their valuations if they get downgraded to a pair programming tool. Honestly I still don't see how they justify their valuations, period. If anything they're serious liabilities. Open-weight models are improving and reaching "good enough" levels for more and more tasks. They're also known quantities; you know what you're getting with them and don't have to worry about the model silently (or not so silently) being switched out from under you (whether that's because Anthropic/OpenAI decides you're not worthy of their latest and greatest for one reason or another, or they switch you to a quantized model to save on compute, or they simply sunset the specific model you've been relying on). And if the open-weight model doesn't run on your local hardware already, there are any number of hosting providers that will handle that for you (so you're back to just paying for colocation/cloud usage instead of nebulous tokens). Closed models are improving as well, sure, but diminishing returns will eventually kick in (as they already have for various tasks, as I said). So if not their models, where does their value come from? Just simple network effects/lock-in? "Normal" users will drift to other options if they start showing more and more ads, and enterprise customers will surely be looking for opportunities to avoid lock-in and reduce risk. I think the last argument I've heard is that these valuations are basically a bet that Anthropic and/or OpenAI will achieve AGI that can fully replace human labor, so they'll essentially be able to sell that replacement labor to everyone. They haven't managed to pull that off, yet, however. Businesses that have tried to replace humans almost immediately realized either that the AI's capabilities were oversold or that they at least needed a human in the loop still, to some degree. And even if they do achieve AGI, that would surely become an issue of national security (they're already flirting with that today), so who's to say governments won't simply nationalize the best AI labs and either remove them from the economy entirely or perhaps even provide models as a public service to level the playing field? That all sounds like a giant gamble, if anything. And it's incredibly frustrating to watch as someone that's been unemployed for a year because (a) budgets are being burned on tokens and (b) LLM-generated applications are flooding hiring teams and preventing real people from being seen. (Not to mention, as someone that spends a lot of time in gaming circles, the fact that DRAM and flash storage is quickly becoming inaccessible is just an additional frustration that means people can't even find temporary relief in entertainment.) I can only hope this bubble finally implodes before I lose my house. | | |
| ▲ | 31 minutes ago | parent | next [-] | | [deleted] | |
| ▲ | pixl97 6 hours ago | parent | prev [-] | | >Open-weight models are ... <banned> Not the first one to come up with that likely outcome either. I mean, if you're being restricted from SOTA models now, how long do you expect before the FBI kicks in your door for using an 'illegal' open model? |
|
| |
| ▲ | pkulak 8 hours ago | parent | prev | next [-] | | And every benchmark is "build GTA-6 from nothing, as a single-page web app". | |
| ▲ | ricardobayes 8 hours ago | parent | prev | next [-] | | They have to, but also everyone working at 3D printing companies thought "industry 4.0" is going to completely override everything, we are going to print housing and going to print a mug at home and drink coffee out of it. Today's news that Amazon is hiring 11k interns. I think part of the AI story was used as a convenient excuse to get rid of some "fat" and some covid overhiring and gave companies an out to change course. | |
| ▲ | rconti 7 hours ago | parent | prev | next [-] | | I wonder how portable the existing models are for different use cases. As good as they are for greenfield development or working in a single or across a few tightly coupled repos, they're absolutely terrible at debugging distributed systems and make incredibly wrong yet extremely confident assertions all the time. I don't know if it's a matter of just requiring a tiny amount of optimization or wholesale redesign. | |
| ▲ | popalchemist 9 hours ago | parent | prev | next [-] | | Whether they believe it or not is immaterial. It is the end-goal they want to achieve, because then they own the means of production entirely. | | |
| ▲ | pigpop 8 hours ago | parent | next [-] | | They own the means of production for the leading models but they're far from monopolizing them since the techniques are well known. At this point it's a matter of having a head start and lots of capital to pay for the data annotation and GPU time to train them. Others are playing catch-up but they're hot on their heals which is the biggest reason for them to continue spending like crazy to keep their leads. For the non-bleeding edge they have a lot of competition with more competitors showing up every day. The way this is playing out is not surprising, it's similar to any other technological breakthrough as it becomes commercialized. Eventually those means of production will become commoditized as well. | | | |
| ▲ | quaverquaver 7 hours ago | parent | prev | next [-] | | these are capital intensive commodity businesses. They can be plenty big - see railroads or airplanes... or refining... but that doesn't mean that most value won't be added elsewhere. | |
| ▲ | jatora 8 hours ago | parent | prev [-] | | I find these nefarious intention theories shallow. It can both be the case that the endstate is them owning the means of production without that being the intended guiding goal. Companies can chase profit without being Leninistic boogeymen. | | |
| ▲ | WhyIsItAlwaysHN 8 hours ago | parent | next [-] | | There is no nefariousness in owning all the means of production, it's the endgame of maximizing profit. However the result is exactly the same, concentration of power. | | |
| ▲ | pigpop 7 hours ago | parent | next [-] | | This is such a defeatist and low agency take. "means of production" are not a limited resource like gold that you have to extract from natural sources or divvy up. They are fundamentally skill and knowledge that anyone can attain and put to use, maybe not on the same scale as a well funded business but even those businesses had to start somewhere in order to grow to the size they are now. So rather than casting aspersions on them, your time would be better spent learning how you too can create some means of production and start producing value. | | | |
| ▲ | popalchemist 6 hours ago | parent | prev [-] | | No nefariousness other than the subjugation of the majority of humanity? You're insane | | |
| ▲ | WhyIsItAlwaysHN 5 hours ago | parent [-] | | What I meant is that nefariousness from people is not a prerequisite. It's a machine that wants to maximize all profit and all the evil is a natural product. If you magically put saints in charge they would be eaten and replaced by the same kind of people very quickly if the end goal remains. |
|
| |
| ▲ | cousinbryce 8 hours ago | parent | prev [-] | | Sam Allan has said some things that would make Lenin blush |
|
| |
| ▲ | jambalaya8 9 hours ago | parent | prev [-] | | As I said, working ourselves out of our jobs within the span of a few years. |
|
|
| ▲ | jerf 8 hours ago | parent | prev | next [-] |
| I've been using Kimi K2.6 lately (don't have 2.7 available through blessed work channels yet) for tasks where I already know what it is I want to do and I want to just step through the process in pieces, and it's fine. Do I have to correct it maybe a bit more than Opus? Yeah, but the real cutoff would be between "I have to read every line" and "I can just trust it without reading every line" and for me neither model hits that mark, and I expect it to be a while yet for that. Is it as good as Opus if I want to spit ball about architecture and then convert that to code? No, but I don't have that problem all the time, and it's there if I do need it. And now in a heavy coding week rather than bumping up against my spend limit by late Wednesday or Thursday I'm comfortably below it all week. That said if anything I feel like I have to reign in K2.6 much more than Opus, actually. If I want to just ask it a question without it inferring some coding task to immediately start doing, it takes a lot more care to prevent it from just running off half-cocked off of an only 3/4s-cocked idea of my own. I use "plan" mode with both but it's somewhat more defensive with K2.6 than Opus. |
|
| ▲ | nozzlegear 9 hours ago | parent | prev | next [-] |
| > I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap. I've moved completely to local models that I run with my M1 Mac Studio (64gb ram) some time ago. But for the rare times when I feel the local, quantized Qwen3.6 isn't enough, I just connect to Openrouter and use something like Kimi, GLM or Deepseek for a fraction of the price of Anthropic et al. |
| |
|
| ▲ | m3h 8 hours ago | parent | prev | next [-] |
| I think you should try an OpenAI model like GPT 5.5. It is better at following instructions and boundaries set during prompt. It feels like a more capable "agent assistant" than Claude models but without loss of intelligence. Most of my work involves "Agentic engineering" instead of fire-and-forget. I like to stay involved during the planning as well as review and ask a lot more questions from the agent than I've seen others doing. In a way, I'm using the agent in a sort of "hyper auto-complete" mode to fill in the blanks (rather big blanks) once I've set out the requirements, scope and design (sometimes specific module boundaries). This works best for me. |
| |
| ▲ | ifwinterco 7 hours ago | parent [-] | | I prefer GPT 5.5 to Opus but both are absurdly expensive token hogs, I can't afford to use either as my main model at $work with the monthly spend cap we have. I use Composer (since we use Cursor) or GPT 5.3-codex as my workhorse models and only break out the big guns when I have a genuinely difficult problem to solve. IMO somewhat weirdly 5.3-codex might be the best overall coding model OpenAI have ever released. It's 90% as good as 5.5 and costs about 20% as much, since it's both cheaper per token and uses fewer tokens for the same task. I'll miss it when they inevitably deprecate it, but hopefully I can use Kimi K2.7 by then | | |
| ▲ | skeptic_ai 2 hours ago | parent | next [-] | | Buy 5 accounts at 20usd each. It’s 100 and lasts decently on single threaded work | |
| ▲ | m3h 7 hours ago | parent | prev [-] | | I didn't realize GPT 5.3 Codex was that good. OpenAI claims to have made their new Terra model as good as GPT 5.5, but with half the cost per intelligence. Hopefully, this will bring it closer to the price you're expecting (or even better considering GPT models have good acceptance/success rates according to benchmarks). |
|
|
|
| ▲ | nsoonhui 2 hours ago | parent | prev | next [-] |
| Sorry, exactly what is the distinction between agent-assist and agent-driven? T I give AI an image and just it what's wrong, and then it goes on to fix the bug in the codebase for me ( and write the tests), is this agent-assist or agent-driven? Sometimes I just give the AI my description, and mockup, and it creates a plan and implements the details for me, and I verify visually ( this is the weak spot of AI), is this agent-assist or agent-driven? |
|
| ▲ | indoordin0saur 3 hours ago | parent | prev | next [-] |
| Yeah. Opus is nice for tasks that require significant planning and considering broader effects on other parts of the code. But it likes to go off the rails and do too much. Often it gives good-sounding ideas but it has a tendency to distract me by giving me a huge to-do list. |
|
| ▲ | jklmnopqrstuvw 9 hours ago | parent | prev | next [-] |
| From my own experience, GLM-5.2 generally cost more tokens and much more slow. |
| |
| ▲ | pimeys 9 hours ago | parent | next [-] | | I use GLM 5.2 Fast from Fireworks and its very fast. Where are you using it from? | |
| ▲ | microtonal 9 hours ago | parent | prev | next [-] | | Which inference provider do you use? (Admittedly, I currently use K2.7 a lot more currently.) | |
| ▲ | james2doyle 9 hours ago | parent | prev [-] | | Tokens and speed are a factor but does it require less back and forth to get things right? Being "fast and cheap but wrong" still has a cost that an otherwise "expensive and slow" exchange does not | | |
| ▲ | paradox460 3 hours ago | parent [-] | | In my experience it spends a lot more tokens to do things. I wrote a tiny extension for omp that counts the number of "Actually" in the response, and if it exceeds a threshold stops execution and waits for me to tell it what to do. Even then it frequently just ignores basic instructions like "only write boilerplate, I will fill in the functionality" Imo MiniMax and MiMo are a lot more reliable (and cheap) Not opus level, but close enough and cheap enough to get the job done |
|
|
|
| ▲ | mohamedkoubaa 10 hours ago | parent | prev | next [-] |
| I've been moving more to Composer 2.5 for the same reason. KISS principle. |
| |
| ▲ | everfrustrated 6 hours ago | parent | next [-] | | Composer 2.5 fast (via Grok) is honestly amazing. Its been implementing everything I've asked and getting it right first time.
Been impressed with it's front end ability. If this was the last model I could ever use I think I would be happy. | |
| ▲ | AdminAdmim 9 hours ago | parent | prev [-] | | Same for me, downgraded Cursor Subscription because when i use Cursor i use 90% Composer 2.5 fast |
|
|
| ▲ | mattmatheus 7 hours ago | parent | prev | next [-] |
| I've been working to use the best model for the task for about 6 months and have found great success doing plan with the 'frontier' model but punting implementation down to a 'lesser' model. I'm using the Beads-Rust (a rust fork of GasTown's beads) as my issue tracker. So far, so good. |
|
| ▲ | whateveracct 9 hours ago | parent | prev | next [-] |
| agent-assisted development uses orders of magnitude fewer tokens than agent-driven development the incentives aren't there sadly |
| |
| ▲ | sanderjd 9 hours ago | parent [-] | | Not for a business model that scales revenue by token usage. But other business models are available. |
|
|
| ▲ | mark_l_watson 7 hours ago | parent | prev | next [-] |
| Good point, I also like to do the work myself, with an assistant under my control. I am usually really happy with DeepSeek v4 Flash that I feel just mostly does what I tell it to do, but I do switch to Pro for harder tasks. There are so many models, and I personally ignore benchmarks so it takes some time to try different models on my use cases. Fortunately, it is ‘good enough’ to do the work to find a few models that work for me, and just use them for a month or two before re-investing time for my own evals to possibly change models. People should evaluate what works for them and ignore other people and benchmarks. (Apologies if that sounds snarky.) |
|
| ▲ | xpct 9 hours ago | parent | prev | next [-] |
| I've been largely disappointed how much the Claude models ignore custom instructions, and sometimes even prompts on the chat interface. It sometimes feels like talking to a wall, or as if there was a third person in the chatroom whose messages I can't see. I can't help but feel this is intentional towards the 'Agentic' workflow. |
| |
| ▲ | spacephysics 9 hours ago | parent | next [-] | | I think this seems purposeful, as there's 2 opposing forces at play:
- Have a model that follows the users instructions
- Have a model that follows the system prompt instructions more For the 'safety' argument (Re: Fable), they need these models to have basically a 2-tier instruction system, but given LLMs aren't great with actual Logic unless they program it out to test, this runs afoul and we get one or the other. Feels like optimizing for either precision or recall, but can't have both | | |
| ▲ | paradox460 3 hours ago | parent | next [-] | | We're speed running HAL 9000 | |
| ▲ | wqaatwt 9 hours ago | parent | prev [-] | | A suppose a solution might be going with a customizable harness like pi and merging Anthropic’s system prompt with a personalized custom one to remove all contractions | | |
| ▲ | arcanemachiner 8 hours ago | parent [-] | | You still have to manage/fight with the post-training that is baked into the model itself. |
|
| |
| ▲ | manveerc 9 hours ago | parent | prev | next [-] | | Totally agreed. I sometimes wonder if they are making the model "lazy" with each iteration, it keeps getting better at avoiding work. | | |
| ▲ | skerit 9 hours ago | parent [-] | | This is why Fable was so good. It followed instructions and it was in no way lazy. | | |
| ▲ | DontchaKnowit 9 hours ago | parent | next [-] | | People keep making comments about fable like this? You could only use it for what like a week? How is that at all enough time to evaluate? Opus 4.6 didnt suffer from this problems for a hot minute and then when newer models were released it got worse. I think they change a ton behind the scenes and allocate compute however they want, so the model you use today may behave much differently than how it behaved yesterday | | |
| ▲ | pdimitar 8 hours ago | parent | next [-] | | > You could only use it for what like a week? How is that at all enough time to evaluate? By observing how in 4 workdays it achieved more than Opus in ~11 days. I am my team's backend lead and the Fable 5 model finally turned the tide on my overwhelming backlog. Back to Opus and I have to treat it like special-education kid multiple times a day. | |
| ▲ | boc 8 hours ago | parent | prev | next [-] | | The ~72 hours I had access to Fable were by far the most productive I've had in months. Re-wrote massive parts of my codebase and caught a ton of bugs and logic issues that had silently slipped through before. I went over my subscription limit and immediately kept paying the API price to keep going. It was that good. | |
| ▲ | plorkyeran 8 hours ago | parent | prev | next [-] | | It was a pretty stark difference. I had the opposite problem where it did too much and overshot what I wanted from it so I certainly assume that if it had stuck around it would have gotten tuned back a bit pretty quickly. | |
| ▲ | marcindulak 6 hours ago | parent | prev | next [-] | | For me claude-fable-5 failed to follow the instruction following test I'm making against various models https://github.com/marcindulak/claude-fails-to-follow-claude... | |
| ▲ | tskj 8 hours ago | parent | prev | next [-] | | You didn't really have to use it more than a day honestly to tell what kind of shocking paradigm change it was. Man do I miss it. | |
| ▲ | Analemma_ 8 hours ago | parent | prev [-] | | Heh, it's not crazy if you're here in the Bay: I know multiple people who more-or-less disappeared for days when Fable came out because they were running their benchmarks, and only emerged blinking into the sunlight when the USG banned it. That's just how things are here now, most people are normal but there are some serious LLM dope addicts out and about. |
| |
| ▲ | acters 9 hours ago | parent | prev [-] | | I've been seeing LLMs act lazy from the very beginning. They got a little better but smaller models really only want to have a single task given to them. Mythos at least does work. RIP |
|
| |
| ▲ | marcindulak 6 hours ago | parent | prev | next [-] | | I keep adding selected cases of CLAUDE.md instructions non-compliance reported on claude-code github to that issue https://github.com/anthropics/claude-code/issues/13689. Subjectively the amount of such cases seems lower during the past month. It may be that claude-opus-4-8 (default thinking) is a bit better at instructions following than past models. | |
| ▲ | gs17 9 hours ago | parent | prev | next [-] | | > or as if there was a third person in the chatroom whose messages I can't see. If you set off a classifier, that's how it looks to Claude. | | |
| ▲ | xpct 9 hours ago | parent [-] | | I wasn't working with anything sensitive, but it really does feel like it sometimes condenses even something low like three bullet points to two. IMO, they were quite good with checklists even a year ago, and tried to tick off each one. |
| |
| ▲ | storus 9 hours ago | parent | prev | next [-] | | Try to run your prompts through Claude to pinpoint any ambiguous parts that can be interpreted in multiple ways, or self-contradictory sections. I typically resolve any prompt-ignoring issues with that. | |
| ▲ | Sohcahtoa82 6 hours ago | parent | prev [-] | | [dead] |
|
|
| ▲ | addozhang 4 hours ago | parent | prev | next [-] |
| I feel pretty much the same way, and the scenarios are similar too. Using Sonnet has a bigger advantage when it comes to response time. |
|
| ▲ | a_c 8 hours ago | parent | prev | next [-] |
| I actually use sonnet 4.6 for my day to day coding too. It consumes much less token and good enough. Opus is just too token consuming for it to be useful to me. |
| |
| ▲ | ricardonunez 3 hours ago | parent | next [-] | | I am in the same position. Do you think they are going to remove it and deprecate it as some point? | |
| ▲ | bazhand 8 hours ago | parent | prev [-] | | Have you tried '/model opusplan' I've had strong results mixing opus for planning with sonnet implementing. | | |
| ▲ | a_c 8 hours ago | parent | next [-] | | I haven't. Thanks for the heads up will give it a try!
I use opus to comment on code design quite often though. It became a pattern that I made a skill for me to ask for second opinions https://news.ycombinator.com/item?id=48733092
Would love to hear your feedback if you don't mind! | |
| ▲ | vtail 8 hours ago | parent | prev [-] | | Fascinating! How did you learn about this? | | |
| ▲ | bazhand 7 hours ago | parent [-] | | It was something that was used for token efficiency. Most of the settings and use cases are quite poorly communicated but asking Claude to review the latest release changelog (https://github.com/anthropics/claude-code/blob/main/CHANGELO...) is quite useful. Combined with @"claude-code-guide (agent)" to read it's own docs for settings/configs is super helpful. The quite useful tool is to use /opusplan along with /codex:rescue (https://github.com/openai/codex-plugin-cc) means you get quite a strongly reviewed plan using native claude + codex without having to implement the mostly useless trust-me-bro plugins and other bs. | | |
|
|
|
|
| ▲ | epolanski 9 hours ago | parent | prev | next [-] |
| I've been saying for ages that since Opus 4.6 models are increasingly smarter but further unhelpful as assistants. Fable was amazing as a vibecoder but as an assistant it can't resist jumping into implementation and filling chats of pointless jargon. It's really grim if you're looking for assistance instead of an implementor. GPT 5.5 Pro and Fable are gorgeous bullshitters that pretend to be right (often convincingly because they are very smart) even when they are wrong and I need tons of energy to process their information. I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files. |
| |
| ▲ | thewebguyd 9 hours ago | parent | next [-] | | By design, unfortunately. If they are just assistants, they can't sell the dream of "we're going to replace human labor completely" to the C-suite. | | |
| ▲ | baq 9 hours ago | parent | next [-] | | It isn’t a dream, it’s a reality for some of us here and it will be increasingly so for everyone else. Amazingly, USG intervening slowed the dynamic greatly (fortunately?) The problem is obviously who will be left. There’s a lot of scifi to catch up on. | |
| ▲ | epolanski 9 hours ago | parent | prev [-] | | I think that they are simply evaluated on prompt to solution benchmarks. |
| |
| ▲ | whstl 8 hours ago | parent | prev | next [-] | | Yep, this is why experiences and ratings of models vary so wildly. I recently migrated a very large web app to Tailwind and Opus kept screwing up over and over, refactoring and changing the design, the more complex the component became. I ended up asking Haiku to do it and it managed to do everything correctly, pretty much without intervention. | |
| ▲ | mullingitover 8 hours ago | parent | prev | next [-] | | > I don't like it but don't know what to do, Anthropic models especially increasingly ignore instructions whether in memory or agents files. I've taken to instructing the agent to manage the subagent, and the principal agent's sole job is to ensuring the subagent follows instructions to the letter. | |
| ▲ | epolanski 6 hours ago | parent | prev [-] | | Just to follow up on what I mean, this was my first interaction with Sonnet 5: "I just cloned this repo, investigate how to set it up, don't install anything, just collect information" _spews information_ I proceed with the setup, but get a Linux specific dependency in a bash script, so I want to evaluate whether it can be rewritten... "There's this error on MacOS, I think it's because we need linux-utils from brew, verify whether the script can be written in bare posix" _proceeds installing linux-utils and all the rest_ "Didn't I tell you to not install anything?" _you're absolutely right_ F*k me.. |
|
|
| ▲ | duxup 7 hours ago | parent | prev | next [-] |
| “Hey I saw some messed up function commented out that at face value is a bad idea… so here it is again with some nonsense assumptions ….” I ask “where did you get that?” … too often if I’m not constantly guiding it, and even then it still goes off the rails. |
|
| ▲ | arikrahman 8 hours ago | parent | prev | next [-] |
| I have also started shifting to models more reasonable for my wokflow. I've been using the Reasonix harness for Deepseek, and cache hits make the token use basically free. This is with unsubsidized models as well, using American providers. |
|
| ▲ | bckr 7 hours ago | parent | prev | next [-] |
| I suggest you encoding your invariants in the harness. Architectural invariants that can be mechanically checked, including which modules are approved, which dependencies, etc. |
|
| ▲ | lacoolj 6 hours ago | parent | prev | next [-] |
| gemma-4-e4b is very good at assistance too, and is local and fast and small (and "free") |
|
| ▲ | trollbridge 8 hours ago | parent | prev | next [-] |
| No kidding. I expect to have models to use which are optimised for different use cases. Sonnet as an autonomous agentic model is silly. We already have other models for that if you want something weaker and cheaper than Opus. |
|
| ▲ | spullara 7 hours ago | parent | prev [-] |
| if you like that, use gpt models instead. |