I have a home server that runs Qwen3.6-35B-A3B through llama.cpp with Open WebUI for the user facing interface.

My teen isn't super interested in AI, but whenever they do feel curious they have their own account they can use on our home network. As far as chatting goes local models are more than capable for handling standard chat questions, doing research, helping troubleshoot problems etc. In fact it was an agent powered by the same model that setup the open webui server and took care of all the account management features through my phone (using Hermes agent).

If you're building AI powered features and using sophisticated agent setups for coding for work, then it make sense to use SoTA from these providers. But I've been using local models increasingly for personal use and am starting to find them preferable (I run an uncensored, ephemeral model for my own use and it's an entirely different experience than anything you can pay for).

Still haven't cancelled my personal Anthropic subscription, but considering it soon.

▲

jrochkind1 9 hours ago | parent | next [-]

What about local models do you find preferable?

I guess "starting to find them preferable" suggests to me you think they work better, but this is surprising to me so I think I may have misunderstood, so I ask!

Like you're saying they work better than the proprietary models (in what ways?), or you find them mostly good enough and prefer the privacy or cost, or what?

	▲	roadside_picnic 9 hours ago \| parent \| next [-]
		There are a couple of things, but basically it boils down to the same reason people prefer Linux to Windows/MacOs: customization, control and privacy (arguably all of these are really subsets of 'control'). Having full control over how your data is retained, what the system prompt is, which version of the model you're running, etc leads to much a more consistent experience. For example, for chat sessions, I can't stand the new "let me push back" version of Claude. For my home models I never have to worry about that. There's never a mystery as to whether the model secretly degraded performance, I always know exactly which model I'm using and how well it's utilizing resources etc. Open models also give you full visibility into the reasoning steps, so you never have to guess what the model is thinking. Then when you start getting into things like uncensored/abliterated models we're talking about something you can't even pay for. In case you're unfamiliar, even open local models have guardrails built in. But people in the community have found ways to remove these. One of the things I've found most concerning about AI, which is under discussed, is the combination of people having personal chats with an agent that both monitors the conversation and refuses to discuss certain topics. This leads to a very deep level of self-censoring I find dystopian. I also have multiple hermes agents setup, some with local backends other with open but non-local backends (e.g. Kimi through the API). For some tasks, I've just started to find the local agent tends to work better for the type of tasks I want (maybe it just over thinks less?). I don't use it for coding so much as research tasks and sysadmin stuff, but I've been really happy with the results. Oh, and let's not forget, especially running on a Mac, these local models are basically free to run.
	▲	jauntywundrkind 7 hours ago \| parent \| prev [-]
		The local models are willing to share their thinking. The Big AI models don't share their thinking, leaving only vague summaries. Having an AI that deliberately cloaks it's reasoning, that goes out of it's way to act like a Searls Chinese Room Experiment, that deliberately conceals information is incredibly gross. I love what I get from Opus or GPT, but mainly I use GLM and it's so starkly apparent how much better it is that it let's me work together with it, that I can nudge it as it works by correcting bad assumptions or clarifying for it, as it works. And... it just doesn't feel icky. It's not a quasi-mystical alien intelligence, which, honestly, gives me strong "this should be destroyed, is unsafe, and feels outright impermissible" vibes. As a coder, seeing thinking saves time and prevents errors. As a civilization, seeing thinking let's people understand what the AI is working with and grounds society in an appreciation for what is happening, keeps us a little moored. Personally, if I were a government, I would not allow it. Recent submission on this, The text in Claude Code’s “Extended Thinking” output is not authentic. https://patrickmccanna.net/the-text-in-claude-codes-extended... https://news.ycombinator.com/item?id=48630535

▲

drusepth 9 hours ago | parent | prev | next [-]

What is an "ephemeral" model in this context?

	▲	roadside_picnic 9 hours ago \| parent [-]
		Just running it through `llama-cli` so that there's absolutely no persistent state related to the chat (and least I believe this to be the case).

▲

agumonkey 9 hours ago | parent | prev | next [-]

What kind of machine is it running on ?

	▲	bakies 5 hours ago \| parent [-]
		I just started using this model on my Framework Desktop and it's very smart and fast.

▲

rvnx 9 hours ago | parent | prev | next [-]

From a privacy perspective, your objective is to stay away from people who have interest to snoop on your conversations.

So from the perspective of your teen, they would benefit from using z.ai or ChatGPT or Claude, etc, rather than the local server where you can see all the conversations.

What uncensored model do you recommend using ?

▲

panny 9 hours ago | parent [-]

>From a privacy perspective, your objective is to stay away from people who have interest to snoop on your conversations.

>So from the perspective of your teen, they would benefit from using z.ai or ChatGPT or Claude, etc, rather than the local server where you can see all the conversations.

That is bonkers. If I were a parent, I would hope my child would trust me more than systems monitored by FBI/NSA/etc. Like, what sort of sick relationship do you have to have with your own family to trust them less than strangers who would sell you into prison slavery for a buck.

	▲	rvnx 8 hours ago \| parent [-]
		Private conversations of a teen have low value for FBI/NSA. They have infinite value to their parents. The state isn't going to ground them, shame them at dinner, out them, or pull them out of a relationship, punish them. Parents reading your browsing history and private conversations when you are 14-18 years old (the age of teenagers) is very very creepy, unless there is a specific danger to avoid. It's like if you read their private journal. Adolescents need a private inner world to form an identity, and heavy parental intrusion ("psychological control") is the real distrust. Trust them, they are people, not possessions. You can guide them, but do not store their private messages locally under your control using the excuse of protecting them from NSA. If they trust you, they will tend to tell you upfront the things they have questions about, there is really no need to spy on their thoughts. Same with husband/wife btw.

▲

rib3ye 9 hours ago | parent | prev | next [-]

How many tokens /sec?

	▲	roadside_picnic 8 hours ago \| parent [-]
		M3-Max laptop: ~55 token/sec RTX 4090: ~190 token/sec I don't have the number around but there is a notable latency for pre-fill on the M3, but once it's running the delay is negligible. The RTX, unsurprisingly, is all around superior performance wise, but: I use that computer for gaming and image gen work so I can't dedicate it as a server, and, especially when it's warmer, the heat generated under heavy loads is noticable.

▲

ai_fry_ur_brain 9 hours ago | parent | prev [-]

> I run an uncensored, ephemeral model for my own use and it's an entirely different experience than anything you can pay for.

Dont. Goon. To. LLMs

▲

fyltr 9 hours ago | parent [-]

Wasn't the parent post referring to 'legitimate' demands? I often use them to get a broad overview of a technical field before reading human stuff on it, and it might be me but those clankers tend to spend half their reasoning on whether they are allowed to reply to my request. Censorship is an annoying waste of capacity for certain use cases, although it certainly has its boons when shipping commercial models.

	▲	ai_fry_ur_brain 4 hours ago \| parent [-]
		He was definately referring to gooning to llms.