> anything you pick up second-hand will still deprecate at that pace

Not really? The people who do local inference most (from what I've seen) are owners of Apple Silicon and Nvidia hardware. Apple Silicon has ~7 years of decent enough LLM support under it's belt, and Nvidia is only now starting to depreciate 11-year-old GPU hardware in drivers.

If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s. Maybe even faster inference because of MoE architectures or improvements in the backend.

▲

Uehreka 10 days ago | parent | next [-]

People on HN do a lot of wishful thinking when it comes to the macOS LLM situation. I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up.

And that’s fine! But then people come into the conversation from Claude Code and think there’s a way to run a coding assistant on Mac, saying “sure it won’t be as good as Claude Sonnet, but if it’s even half as good that’ll be fine!”

And then they realize that the heavvvvily quantized models that you can run on a mac (that isn’t a $6000 beast) can’t invoke tools properly, and try to “bridge the gap” by hallucinating tool outputs, and it becomes clear that the models that are small enough to run locally aren’t “20-50% as good as Claude Sonnet”, they’re like toddlers by comparison.

People need to be more clear about what they mean when they say they’re running models locally. If you want to build an image-captioner, fine, go ahead, grab Gemma 7b or something. If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu.

	▲	EagnaIonat 9 days ago \| parent \| next [-]
		> I feel like most of the people touting the Mac’s ability to run LLMs are either impressed that they run at all, are doing fairly simple tasks, or just have a toy model they like to mess around with and it doesn’t matter if it messes up. I feel like you haven't actually used it. Your comment may have been true 5 years ago. > If you want an assistant you can talk to that will give you advice or help you with arbitrary tasks for work, that’s not something that’s on the menu. You can use a RAG approach (eg. Milvus) and also LoRA templates to dramatically improve the accuracy of the answer if needed. Locally you can run multiple models, multiple times without having to worry about costs. You also have the likes of Open WebUI which builds numerous features on top of an interface if you don't want to do coding. I have a very old M1 MBP 32GB and I have numerous applications built to do custom work. It does the job the fine and speed is not an issue. Not good enough to do a LoRA build but I have a more recent laptop for that. I doubt I am the only one.
	▲	bigyabai 10 days ago \| parent \| prev [-]
		I agree completely. My larger point is that Apple and Nvidia's hardware has depreciated less slowly, because they've been shipping highly dense chips for a while now. Apple's software situation is utterly derelict and it cannot be seriously compared to CUDA in the same sentence. For inference purposes, though, compute shaders have worked fine for all 3 manufacturers. It's really only Nvidia users that benefit from the wealth of finetuning/training programs that are typically CUDA-native.

▲

Aurornis 10 days ago | parent | prev [-]

> If you bought a decently powerful inference machine 3 or 5 years ago, it's probably still plugging away with great tok/s.

I think this is the difference between people who embrace hobby LLMs and people who don’t:

The token/s output speed on affordable local hardware for large models is not great for me. I already wish the cloud hosted solutions were several times faster. Any time I go to a local model it feels like I’m writing e-mails back and forth to an LLM, not working with it.

And also, the first Apple M1 chip was released less than 5 years ago, not 7.

	▲	bigyabai 10 days ago \| parent [-]
		> Any time I go to a local model it feels like I’m writing e-mails back and forth Do you have a good accelerator? If you're offloading to a powerful GPU it shouldn't feel like that at all. I've gotten ChatGPT speeds from a 4060 running the OSS 20B and Qwen3 30B models, both of which are competitive with OpenAI's last-gen models. > the first Apple M1 chip was released less than 5 years ago Core ML has been running on Apple-designed silicon for 8 years now, if we really want to get pedantic. But sure, actual LLM/transformer use is a more recent phenomenon.