Remix.run Logo
Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference(gizmoweek.com)
238 points by takumi123 14 hours ago | 153 comments
veunes 3 hours ago | parent | next [-]

I noticed the inference is routed through the gpu rather than the Apple neural engine. Google’s engineers likely gave up on trying to compile custom attention kernels for Apple’s proprietary tensor blocks iirc. While Metal is predictable and easy to port to, it drains the battery way faster than a dedicated NPU. Until they rewrite the backend for the ANE, this is just a flashy tech demo rather than a production-ready tool

jonathaneunice 2 hours ago | parent | next [-]

Are the Apple neural engines even a practical target of LLMs?

Maybe not strictly impossible, but ANE was designed with an earlier, pre-LLM style of ML. Running LLMs on ANE (e.g. via Core ML) possible in theory, but the substantial model conversion and custom hardware tuning required makes for a high hurdle IRL. The LLM ecosystem standardized around CPU/GPU execution, and to date at least seems unwilling to devote resources to ANE. Even Apple's MLX framework has no ANE support. There are models ANE runs well, but LLMs do not seem to be among them.

Yukonv 2 hours ago | parent [-]

It is possible but requires a very specific model design to utilize. As this reverse engineering effort has shown [0] "The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine." To build one requires using a specific pipeline specifically for CoreML [1].

[0] https://maderix.substack.com/p/inside-the-m4-apple-neural-en... [1] https://developer.apple.com/documentation/coreml

GeekyBear an hour ago | parent | prev | next [-]

It will be interesting to see how things change in a couple of months at WWDC, when Apple is said to be replacing their decade old CoreML framework with something more geared for modern LLMs.

> A new report says that Apple will replace Core ML with a modernized Core AI framework at WWDC, helping developers better leverage modern AI capabilities with their apps in iOS 27.

https://9to5mac.com/2026/03/01/apple-replacing-core-ml-with-...

tjoff 3 hours ago | parent | prev | next [-]

I'm certainly fine with it drawing some power.

Running background processes might motivate the use of NPU more but don't exactly feel like a pressing need. Actively listen to you 24/7 and analyze the data isn't a usecase I'm eager to explore given the lack of control we have of our own devices.

liuliu an hour ago | parent | prev | next [-]

ANE is OK, but it pretty much needs to pack your single vector into at least 128. (Draw Things recently shipped ANE support inside our custom inference stack, without any private APIs). For token generation, that is not ideal, unless you are using a drafter so there are more tokens to go at one inference step.

It is an interesting area to explore, and yes,this is a tech demo. There is a long way to go to production-ready, but I am more optimistic now than a few months back (with Flash-MoE, DFlash, and some tricks I have).

chatmasta an hour ago | parent | prev | next [-]

Isn’t Apple paying Google billions of dollars to license these things? Surely they should make it easier to compile for their native engines…

the_pwner224 3 hours ago | parent | prev | next [-]

> Google’s engineers likely gave up on trying to compile custom attention kernels for Apple’s proprietary tensor blocks iirc.

The AI Edge Gallery app on Android (which is the officially recommended way to try out Gemma on phones) uses the GPU (lacks NPU support) even on first party Pixel phones. So it's less of "they didn't want to interface with Apple's proprietary tensor blocks" and more of that they just didn't give a f in general. A truly baffling decision.

satvikpendem an hour ago | parent [-]

Edge Gallery does have NPU support, it needs you to install the beta of AICore on the Play Store, the Edge Gallery app has instructions.

the_pwner224 an hour ago | parent [-]

Huh I didn't see those instructions when I tried it last week. Must not have looked closely enough. I do remember it not having NPU support (confirmed by other people) back at the Gemma 3 launch a while ago.

satvikpendem an hour ago | parent [-]

Yes they added it for Gemma 4. Maybe it detects whether your phone has an NPU or not too. I have a OnePlus 15 which does have it.

satvikpendem an hour ago | parent | prev | next [-]

Edge Gallery app on Android has NPU support but it requires a beta release of AICore so I'm sure the devs are working on similar support for Apple devices too.

InMice an hour ago | parent | prev | next [-]

On my iphone i can choose CPU or GPU in edge gallery. What would be the difference if I used CPU?

bigyabai an hour ago | parent | prev [-]

The ANE is not a fast or realistic way to infer modern LLMs.

temp7000 7 hours ago | parent | prev | next [-]

Is it me, or does the article sound like LLM output?

The pattern "It's not mere X — it's Y", occurs like 4 times in the text :v

Andrex 5 hours ago | parent | next [-]

I can't believe you'd impugn the high moral standards of "gizmoweek dot com".

BeetleB 5 hours ago | parent | prev | next [-]

I don't care if it's written by an LLM.

The problem with the article is the complete lack of details. No benchmarks on the iPhone capable models. No details, whatsoever.

Human or LLM - the article is a whole lot of nothing.

doliveira 5 hours ago | parent | next [-]

Funnily enough, to me these aphorisms (?) sound almost like the replicant test in Blaze Runner. Like these are the unit bit of "nudging"

veunes 3 hours ago | parent | prev [-]

This article is all fluff because real benne marketing. If they mentioned that a 4B model on an iPhone 16 drains 15% of the battery for a single long prompt and triggers hard thermal throttling after 20 seconds, nobody would be clicking on headlines about "commercial viability" fwiw

Domenic_S 3 hours ago | parent [-]

I ran several Gemma 4 quants on my 24gb mac mini, and with proper context size tuning they're quick enough I guess, but I would really love to see them working well on an iphone with 2/3gb of ram...

caminante 6 hours ago | parent | prev | next [-]

Ran it through Claude, Grok, whatever...for me, they all flagged issues (no sources, punchy phrases with repetition,...) with these content farms.

My favorite: couldn't even prove the author is a real person. They all found no record!

itissid 6 hours ago | parent [-]

As someone said we live in a strange but amazing era, where although it has never been easier to be deceived, but its _also_ much easier to uncover said deception especially on the internet.

ryandvm 5 hours ago | parent | next [-]

Or at least think you've uncovered deception. It's not clear to me yet that any of these "AI detectors" are reliable, and if they are, it's just an arms race.

walthamstow 5 hours ago | parent | prev [-]

It's much faster and simpler to assume everything on the internet is crooked

figmert 7 hours ago | parent | prev | next [-]

> :v

I guess I found the millennial. I haven't seen that in so long!

Den_VR 6 hours ago | parent | next [-]

:<

neals 5 hours ago | parent [-]

:')

Andrex 5 hours ago | parent [-]

>_>

xiconfjs 3 hours ago | parent [-]

\o/

yangm97 5 hours ago | parent | prev [-]

Analog emojis FTW

Melatonic 3 hours ago | parent [-]

¯\_(ツ)_/¯

altruios 4 hours ago | parent | prev | next [-]

It is like the AI is training us to avoid certain language patterns. I rebel at the hostage of weak language: for strong language is next.

Melatonic 3 hours ago | parent [-]

The mighty semi colon prepares for its return !

odo1242 3 hours ago | parent | prev | next [-]

It does in fact sound like LLM output

mtremsal 7 hours ago | parent | prev | next [-]

An AI slop pattern so widespread it’s now referred to as “it’s not pee pee it’s poo poo”.

lynndotpy 3 hours ago | parent | next [-]

It's not just a widespread pattern –––––––––––––––– it's a sign of things to come.

Domenic_S 3 hours ago | parent [-]

You didn't just nail it ------------ you cut to the core of the issue.

Cider9986 5 hours ago | parent | prev [-]

I haven't heard that—that's good.

wtyvn 3 hours ago | parent | prev | next [-]

Smells like slop to me, looks like the site exists solely to garner search hits.

kbouw 7 hours ago | parent | prev [-]

You would be correct. Ran the article through GPTZero, 100% AI.

subscribed 6 hours ago | parent | next [-]

These detectors are a scam falsely flagging non-native English speakers: https://plagiarismcheckerai.app/ai-detector-false-positives-...

At this point relying on their judgement is beyond folly.

cubefox 6 hours ago | parent [-]

It's both ironic an confusing that this website itself promotes an AI detector.

xd1936 7 hours ago | parent | prev | next [-]

https://redd.it/13mft8s

rationalist 6 hours ago | parent [-]

user-friendly Old reddit link:

https://old.reddit.com/r/ChatGPT/comments/13mft8s/apparently...

71bw 7 hours ago | parent | prev | next [-]

Would not trust any of these tools in the slightest.

devmor 6 hours ago | parent | prev [-]

AI detectors that use text as a basis are not real. It is fundamentally impossible for them to exist.

HarHarVeryFunny 5 hours ago | parent [-]

Huh?

LLM output doesn't have the variety of human output, since they operate in fixed fashion - statistical inference followed by formulaic sampling.

Additionally, the statistics used by LLMs are going be be similar across different LLMs since at scale its just "the statistics of the internet".

Human output has much more variety, partly because we're individuals with our own reading/writing histories (which we're drawing upon when writing), and partly because we're not so formulaic in the way we generate. Individuals have their own writing styles and vocabulary, and one can identify specific authors to a reasonable degree of accuracy based on this.

It's a bit like detecting cheating in a chess tournament. If an unusually high percentage of a player's moves are optimal computer moves, then there is a high likelihood that they were computer generated. Computers and humans don't pick moves in the same way, and humans don't have the computational power to always find "optimal" moves.

Similarly with the "AI detectors" used to detect if kids are using AI to write their homework essays, or to detect if blog posts are AI generated ... if an unusually high percentage of words are predictable by what came before (the way LLMs work), and if those statistics match that of an LLM, then there is an extremely high chance that it was written by an LLM.

Can you ever be 100% sure? Maybe not, but in reality human written text is never going to have such statistical regularity, and such an LLM statistical signature, that an AI detector gives it more than a 10-20% confidence of being AI, so when the detector says it's 80%+ confident something was AI generated, that effectively means 100%. There is of course also content that is part human part AI (human used LLM to fix up their writing), which may score somewhere in the middle.

ben_w 4 hours ago | parent [-]

> LLM output doesn't have the variety of human output, since they operate in fixed fashion - statistical inference followed by formulaic sampling.

This is the wrong thing to look at; your chess analogy is much stronger, the detection method similar (if you can figure out a prompt that generates something close to the content, it almost certainly isn't human origin).

But to why the thing I'm quoting doesn't work: If you took, say, web comic author Darren Gav Bleuel, put him in a sci-fi mass duplication incident make 950 million of him, and had them all talking and writing all over the internet, people would very quickly learn to recognise the style, which would have very little variety because they'd all be forks of the same person.

Indeed, LLMs are very good at presenting other styles than their defaults, better at this than most humans, and what gives away LLMs is that (1) very few people bother to ask them to act other than their defaults, and (2) all the different models, being trained in similar ways on similar data with similar architectures, are inherently similar to each other.

newsoftheday 3 hours ago | parent [-]

What if the prompt includes, "Produce output that doesn't sound like an AI generated it."?

js8 an hour ago | parent [-]

I got curious and tried: https://claude.ai/share/3af7bd6a-15f8-4533-9dc3-a44adef255b3

blixt 5 hours ago | parent | prev | next [-]

I made this offline pocket vibe coder using Gemma 4 (works offline once model is downloaded) on an iPhone. It can technically run the 4B model but it will default to 2B because of memory constraints.

https://github.com/blixt/pucky

It writes a single TypeScript file (I tried multiple files but embedded Gemma 4 is just not smart enough) and compiles the code with oxc.

You need to build it yourself in Xcode because this probably wouldn't survive the App Store review process. Once you run it, there are two starting points included (React Native and Three.js), the UX is a bit obscure but edge-swipe left/right to switch between views.

mandeepj 4 hours ago | parent [-]

You might find it useful - https://news.ycombinator.com/item?id=45129160

I think react native can be switched with swift

codybontecou 8 hours ago | parent | prev | next [-]

Unfortunately Apple appears to be blocking the use of these llms within apps on their app store. I've been trying to ship an app that contains local llms and have hit a brick wall with issue 2.5.2

liuliu an hour ago | parent | next [-]

In case someone don't know, this is the full text:

> 2.5.2 Apps should be self-contained in their bundles, and may not read or write data outside the designated container area, nor may they download, install, or execute code which introduces or changes features or functionality of the app, including other apps. Educational apps designed to teach, develop, or allow students to test executable code may, in limited circumstances, download code provided that such code is not used for other purposes. Such apps must make the source code provided by the app completely viewable and editable by the user.

Why is this related to local LLMs in app?

GeekyBear an hour ago | parent [-]

A vibe coding app that generates new executable code and runs it would:

> execute code which introduces or changes features or functionality of the app,

Gareth321 7 hours ago | parent | prev | next [-]

I think Apple will become increasingly draconian about LLMs. Very soon people won't need to buy many of their apps. They can just make them. This threatens Apple's entire business model.

raw_anon_1111 6 hours ago | parent | next [-]

It came out in the Epic trial that 90% of App Store revenue comes from in app purchases of loot boxes and other pay to win mechanics.

Apple doesn’t care about revenue from a random TODO app.

thinkthatover an hour ago | parent [-]

truly a k-shaped economy we live in

GeekyBear an hour ago | parent | prev | next [-]

They are said to be introducing a framework to make it easier to integrate modern LLMs into apps in a couple of months at WWDC.

https://9to5mac.com/2026/03/01/apple-replacing-core-ml-with-...

mrkpdl 7 hours ago | parent | prev | next [-]

But… why would I put the effort into getting an llm to make me an app when a there’s an existing app that I don’t have to maintain? I don’t want to have to make every app I use?

orrito 7 hours ago | parent [-]

There's a huge difference between local apps that cost one time 3-10$ and apps that ask for a subscription between 5 to 20$ per month. the first category will remain and might become more popular as quality increases, the second category will be oblitereated as the value isn't there, even if all the buyers are rich. The second group takes up a much larger part of the pie than the first though, so apple's revenue will decrease.

davidmurdoch 6 hours ago | parent [-]

All apps that don't have a tangible component, legal protection (like music, tv, movies), or a personality behind it will trend towards $0.

StilesCrisis 7 hours ago | parent | prev | next [-]

Apple's business model isn't really affected by 2% of its users choosing not to spend $100/yr on the App Store. That isn't even a blip on the radar.

A kid playing Roblox can spend more than that in a good weekend.

borborigmus 7 hours ago | parent | prev | next [-]

VibeOS. It’s just an LLM from which all other userspace is vibed.

username223 3 hours ago | parent [-]

vibe-ls(1) - often list directory contents, but maybe do something else.

Where can I get this amazing technology?

Forgeties79 7 hours ago | parent | prev [-]

I guess I am not seeing why would I want to abandon most (if any) of my simple, small, purpose-built apps that always do the exact thing I want for a private company’s ever-changing LLM that will approximate what I’m asking and approximate its response utilizing far more resources.

I’m sure there are things on my phone it could replace (though I struggle to think of them) but there are plenty it can’t. My black magic camera app, web browsers, local send, libby/hoopla…

I can’t really think of any apps I use every day - or every week - that an LLM would replace. I’m not coding on my smartphone and aside from that an LLM is basically a more complex, somewhat inconsistent search engine experience right now for most people. Siri didn’t replace any of my apps, for instance. Why would chatGPT?

TL;DR: what apps would an LLM replace on my iPhone?

CubsFan1060 7 hours ago | parent | prev | next [-]

Though of course Apple's rules aren't always consistent, I have 2 separate apps currently on my phone that can/are running this (Google's Edge Gallery and Locally AI)

codybontecou 3 hours ago | parent | next [-]

They've been slowly cutting them off of updates and/or taking them off the app store entirely.

See Anywhere and Replit. Anywhere was the #1 or #2 app and was taken off the app store entirely before being put on and then taken off again.

Last I checked, Replit hasn't received an update on the iOS app store in over two months due to reviews denying them.

cyanydeez 7 hours ago | parent | prev [-]

Can't be just a SaaSpocolypse. LLMs with the right harness could obliterate much of the TODO+ apps with a general assistant.

But it's more likely it's just walled garden + security theatre that'll keep them from allowing outside apps.

varispeed 7 hours ago | parent [-]

Wouldn't trust AI to run TODO, especially weak models. They can hallucinate tasks, forget to remind etc.

tapvt 6 hours ago | parent [-]

LLMs are stateless. But given an actual database of task-shaped items and some work, I could see the potential.

With a canonical source of truth, and set input/output expectations, the potential blast radius is quite small.

wpm 5 hours ago | parent [-]

And the end results is.....? What? A todo app that takes 16GB of RAM?

bigyabai 5 hours ago | parent [-]

Nothing that Mac and Windows users aren't already used to.

Forgeties79 3 hours ago | parent [-]

It’s tempting to be flippant about MacOS/windows but in all seriousness, the resources required for an LLM to do the job of a typical lighter weight app/software is a serious consideration. No amount of bloat matches what an LLM needs.

bigyabai an hour ago | parent [-]

> No amount of bloat matches what an LLM needs.

I don't think that's necessarily true. For instance, LinkedIn uses more memory than Gemma E2B inference does.

Forgeties79 20 minutes ago | parent [-]

LinkedIn is an entirely different category and an extreme case at that. We’re not talking about LLM’s replacing LinkedIn either. It’s an entirely different comparison/discussion.

pj_mukh 6 hours ago | parent | prev | next [-]

Is this an issue with Cactus compute stuff as well?

MillionOClock 7 hours ago | parent | prev | next [-]

What is your app doing? Just LLM inference?

codybontecou 3 hours ago | parent [-]

It's a custom agent harness with on-device models and the ability to swap between models.

Basically, a "toy" app to showcase where we are with coding agents on-device.

saagarjha 7 hours ago | parent | prev | next [-]

Use of the LLMs to do what?

amelius 4 hours ago | parent | prev [-]

Seriously, how do people put up with being nannied by Apple?

Come on folks, their IT hardware may be nice but supporting them is not worth it.

karimf 8 hours ago | parent | prev | next [-]

Related: Gemma 4 on iPhone (254 comments) - https://news.ycombinator.com/item?id=47652561

redbell 8 hours ago | parent [-]

Another related submission from 22 days ago : iPhone 17 Pro Demonstrated Running a 400B LLM (+700pts, +300cmts): https://news.ycombinator.com/item?id=47490070

zozbot234 7 hours ago | parent [-]

That's very impressive but it's streaming in weights from flash storage. That's not really viable in a mobile context, it will use way too much power. Smaller models are way more applicable to typical use, perhaps with mid-sized models (like the Gemma4 26A4B model) using weights offload from SSD for rare uses involving slower "pro" inference.

hadlock an hour ago | parent [-]

10 minutes a day of extreme power usage is probably fine for people asking for directions to the store, setting calendar reminders, timers, checking for important emails etc. AI on your phone will be incredibly useful but power usage doesn't matter when total usage is less than 15 minutes per day. I don't think the average person expects to vibe code on the phone for 8 hours a day.

zozbot234 16 minutes ago | parent [-]

10 minutes a day or 15 minutes a day is what the inference workload is like on fairly small models. Once you start streaming in weights from SSD, things slow down quite a bit and become quite power hungry.

abc_lisper an hour ago | parent | prev | next [-]

Careful with using these small models. The other day, I asked it "Can dogs eat avocado" and answer was emphatic Yes.

This is not meant as a criticism, but people should be aware of their limitations.

jacobr1 an hour ago | parent [-]

well, technically they can ...

mfro 6 hours ago | parent | prev | next [-]

Strangely, it is super fast on my 16 Plus, but with longer messages it can slow down a LOT, and not because of thermal throttling. I wish I could see some diagnostic data.

steve-atx-7600 6 hours ago | parent [-]

Inference from an LLM is O(tokens^2)

halJordan 3 hours ago | parent [-]

Only in the naive implementations of attention

conception 6 hours ago | parent | prev | next [-]

I’m pretty excited about the edge gallery ios app with gemma 4 on it but it seems like they hobbled it, not giving access to intents and you have to write custom plugins for web search, etc. Does anyone have a favorite way to run these usefully? ChatMCP works pretty well but only supports models via api.

Chrisszz 7 hours ago | parent | prev | next [-]

I just installed Google Ai Edge Gallery on my iPhone 16 pro, here are the results of the first benchmark with GPU, Prefill Tokens=256, Decode Tokens=256, Number of runs: 3. Prefill Speed=231t/s, Decode Speed=16t/s, Time to First Token=1.16s, First init time=20s

rich_sasha 4 hours ago | parent | prev | next [-]

Offline or not, I'm sure Google uploads every keystroke, phone orientation, photo, WiFi endpoints and your shoe size when you interact with it. To enhance your experience.

adrian17 3 hours ago | parent | next [-]

They released the source (well, currently only the Android version) at https://github.com/google-ai-edge/gallery .

At a glance, I see they do gather analytics about how much the app is used (model downloads, model invocations etc) without message content, pretty much just the model used.

tsycho 4 hours ago | parent | prev | next [-]

> ...your shoe size

The funny thing is that a lot of Google's internal training content uses an imaginary product "gShoe", and discusses the privacy implications of data that such a shoe might collect :D

declan_roberts 3 hours ago | parent | prev [-]

Apple is paying Google $1billion for an AI strategy that runs on device. We're seeing the preview of what that will look like.

politelemon an hour ago | parent | prev | next [-]

This is HN clickbait. No details or evidence, this is just generated for votes.

I think this should be flagged.

juancn 3 hours ago | parent | prev | next [-]

Gemma4 is still power hungry since it tends to activate pretty much every weight.

qwen3-coder-next uses a lot less since it seems to only activate ~3B parameters at a time.

My guess is that this is still close to tech demo, and a lot of performance is left on the table.

jimbokun 5 hours ago | parent | prev | next [-]

I feel like UX and API design are very under explored.

What are the possibilities of an Android or iOS device where the OS is centered around a locally running LLM with an API for accessing it from apps, along with tools the LLM can call to access data from locally running apps? What’s the equivalent of the original Mac OS?

Do apps disappear and there’s just a running dialog with the LLM generating graphical displays as needed on demand?

mistic92 9 hours ago | parent | prev | next [-]

It runs on Android too, with AI Core or even with llama.cpp

srslyTrying2hlp 6 hours ago | parent [-]

Its more impressive when Apple does it because they are so far behind.

I remember being excited when Apple got widgets because then I could add my 'Next Alarm time' to my home screen. Made my company work phone usable on trips.

I wonder when they are going to get NVIDIA cards or CUDA? Then they can actually run LLMs and not just trick people into buying it under the 30 year old idea of 'Unified Memory'.

bigyabai 6 hours ago | parent [-]

It's kinda funny that macOS supported CUDA when it was a tech demo, but then ideologically objects to it once it's a $3 trillion business.

They've had to be dragged kicking and screaming away from the NPU model only to admit that GPGPU tech was the right choice.

srslyTrying2hlp 5 hours ago | parent [-]

Yeah I remember that. Very Apple of them.

'Cool demo' -> Doesnt convert to tangible things.

Wont attempt to compete with companies better than them, but go their own route. "oh look it consumes low power!" (Things no one cared about).

They are the Nintendo of tech.

usmanshaikh06 8 hours ago | parent | prev | next [-]

ESET is blocking this site saying:

Threat found This web page may contain dangerous content that can provide remote access to an infected device, leak sensitive data from the device or harm the targeted device. Threat: JS/Agent.RDW trojan

zache6 6 hours ago | parent [-]

Same on my device.

declan_roberts 3 hours ago | parent | prev | next [-]

I really hope this is a preview of the replacement for Siri that Google is creating bc these models are fantastic for their size!

halJordan 3 hours ago | parent [-]

Google is not creating a replacement for anything.

Apple is getting a base Gemini model (not a Gemma), and it will run on Apple private compute. Apple foundational models will remain the on device model

deckar01 5 hours ago | parent | prev | next [-]

They still don’t render the markdown (or LaTeX) it outputs.

pabs3 8 hours ago | parent | prev | next [-]

> edge AI deployment

Isn't the "edge" meant to be computing near the user, but not on their devices?

stingraycharles 8 hours ago | parent | next [-]

No it does not. This is about as “edge” as AI gets.

In a general sense, edge just means moving the computation to the user, rather than in a central cloud (although the two aren’t mutually exclusive, eg Cloudflare Workers)

pgt 8 hours ago | parent | prev | next [-]

Your device is the ultimate edge. The next frontier would be running models on your wetware.

elcritch 8 hours ago | parent | next [-]

Not just running it on your wetware, but charging you for it.

Can't wait until AI companies go from mimicking human thought to figuring how to licensing those thoughts. ;)

acters 8 hours ago | parent | prev [-]

Man can't wait for AI in my brain. And then intelligence will be pay to win.

hhh 8 hours ago | parent | prev | next [-]

It depends, because edge is a meaningless term and people choose what they want for it. In 2022, we set up a call with a vendor for ‘edge’ AI. Their edge meant something like 5kW, and our edge was a single raspberry pi in the best case.

davidmurdoch 5 hours ago | parent | prev [-]

For sure. 1000%. Anyone disagreeing with this has lost their marbles.

For those that have lost their marbles: sure, people use words incorrectly, but that does mean we all have to use those words incorrectly.

In compute vernacular, "edge" means it's distributed in a way that the compute is close to the user (the "user" here is the device, not a person); "on device" means the compute is on the device. They do not mean the same thing.

bearjaws 7 hours ago | parent | prev | next [-]

Would love to see a show down of performance on iPhone vs Googles Tensor G5, which in my experience the G5 is 2 full generations behind performance wise.

DoctorOetker 6 hours ago | parent | prev | next [-]

does anyone know of a decent but low memory or low parameter count multilingual model (as many languages as possible), that can faithfully produce the detailed IPA transcription given a word in a sentence in some language?

I want to test a hypothesis for "uploading" neural network knowledge to a user's brain, by a reaction-speed game.

estimator7292 6 hours ago | parent [-]

Espeak-ng.

You don't need a neural network. Traditional NLP is far better at this task. The keyword you're looking for is "phoenemizer"

DoctorOetker 6 hours ago | parent [-]

can Espeak-ng provide the IPA transcription? or does it produce sound?

I'm surprised traditional NLP being better than ML models for this task, can you point me to a benchmark analysis pointing out that non-neural Espeak-ng is better than ML models?

Also, I asked for a neural model for another reason as well, I still want semantic knowledge present, I want more than pronunciation, but before I use myself as a test subject, I want to make sure I get the proper pronunciation in case the highly speculative "uploading game" works... I don't want to early systematically mis-train myself on pronunciation...

logicallee 8 hours ago | parent | prev | next [-]

For those who would like an example of its output, I'm currently working through creating a small, free (cc0, public domain) encyclopedia (just a couple of thousand entries) of core concepts in Biology and Health Sciences, Physical Sciences, and Technology. Each entry is being entirely written by Gemma 4:e4b (the 10 GB model.) I believe that this may be slightly larger than the size of the model that runs locally on phones, so perhaps this model is slightly better, but the output is similar. Here is an example entry:

https://pastebin.com/ZfSKmfWp

Seems pretty good to me!

everyday7732 7 hours ago | parent [-]

What's your goal? Do you have a project you want the encyclopedia for?

the_inspector 6 hours ago | parent | prev | next [-]

You are referring to the edge models, right? E2B and E4B, not the bigger ones (26B, 31B)...

ValleZ 8 hours ago | parent | prev | next [-]

There are many apps to run local LLMs on both iOS & Android

srslyTrying2hlp 6 hours ago | parent [-]

Once you realize Apple and Nintendo have 'understandings' with media outlets, you will start to realize this is just marketing.

grimmai143 6 hours ago | parent | prev | next [-]

Do you know of a way of running these models on Android? Also, what does the thermal throttling look like?

robmccoll 6 hours ago | parent [-]

Edge Gallery by Research at Google

andsoitis 13 hours ago | parent | prev | next [-]

is there a comparison of it running on iPhone vs. Android phones?

jeroenhd 7 hours ago | parent | next [-]

Running Gemma-4-E2B-it on an iPhone 15 (can't go higher than that due to RAM limitations) versus a Pixel 9 Pro, I don't really notice much of a difference between the two. The Pixel is a bit faster, but also a year more recent.

The model itself works absolutely fine, though the iPhone thermal throttles at some point which really reduces the token generation speed. When I asked it to write me a business plan for a fish farm in the Nevada desert, it slowed down after a couple thousand tokens, whereas the Pixel seems to just keep going.

veunes 3 hours ago | parent [-]

It’s likely a llama.cpp backend issue. On the Pixel, inference hits QNN or a well-optimized Vulkan path that distributes the SoC load properly. On the iPhone, everything is shoved through Metal, which maxes out the GPU immediately and causes instant overheating. Until Apple opens up low-level NPU access to third-party models, iPhones will just keep melting on long-context prompts

lrvick 9 hours ago | parent | prev [-]

You can run Android on just about anything so it boils down to Linux GPU benchmarks.

fsiefken 8 hours ago | parent | next [-]

That doesn't answer the question, I'm curious too. I think there's a speed and battery advantage on the A19 Pro chip compared to the Snapdragon 8 Elite Gen 5 chip, but to know for sure one has to run the same model used in the most efficient way on both machines (flagships ios and android).

srslyTrying2hlp 6 hours ago | parent | prev [-]

I dont think you should have been downvoted. Processing and memory are the only thing that matters. (Unless we are being so nontechnical now that we just say things like Pixel 9 is great...)

bossyTeacher 9 hours ago | parent | prev | next [-]

Is the output coherent though? I am yet to see a local model working on consumer grade hardware being actually useful.

the_pwner224 7 hours ago | parent | next [-]

I have a 128 GB Strix Halo tablet (same as the other commenter here with the Framework Desktop). I'm using the larger Gemma 4 26B-A4B model (only 28 GB @ Q8) and it's been working great and runs very fast.

It's a 100% replacement for free ChatGPT/Gemini.

Compared to the paid pro/thinking models... Gemma does have reasoning, and I have used the reasoning mode for some tax & legal/accounting advice recently as well as other misc problems. It's worked well for that, but I haven't tried any real difficult tasks. From what I've heard re. agentic coding, the open weight models are ~18-24 months behind Anthropic & Google's SOTA.

Qwen 3.5 122B-A10B should just fit into 128 GB with a Q4/5 and may be a bit smarter. There's apparently also a similar sized Gemma 4 model but they haven't released it yet, the 26B was the largest released.

veunes 3 hours ago | parent | next [-]

Sure, 26B models on beefy desktop silicon are finally nipping at the heels of commercial APIs, but this is a mobile thread. On a phone with 8GB of RAM and passive cooling, your tokens per second (t/s) are going to fall off a cliff after the first minute of sustained compute

zozbot234 7 hours ago | parent | prev [-]

There's a 31B dense model in the Gemma 4 series that's obviously going to be smarter (though a whole lot slower) than the MoE 26A4B.

the_pwner224 7 hours ago | parent [-]

I tried it and it was unusably slow at ~5-6 TPS. 26A4B gets close to 40 TPS which is faster than you can read, and still pretty quick with reasoning enabled.

RobMurray 4 hours ago | parent | prev | next [-]

I'm working on a visual description app for the blind. even Gemma 4 E2B can give very useful image descriptions while at the same time taking questions as audio. It's also much faster than most of the current popular cloud based apps like Be My Eyes.

jeroenhd 8 hours ago | parent | prev | next [-]

Google's models work quite well on my Android phone. I haven't found a use case beyond generating shitposts, but the model does its job pretty well. It's not exactly ChatGPT, but minor things like "alter the tone of this email to make it more professional" work like a charm.

You need a relatively beefy phone to run this stuff on large amounts of text, though, and you can't have every app run it because your battery wouldn't last more than an hour.

I think the real use case for apps is more like going to be something like tiny, purpose-trained models, like the 270M models Google wants people to train and use: https://developers.googleblog.com/on-device-function-calling... With these things, you can set up somewhat intelligent situational automation without having to work out logic trees and edge cases beforehand.

lrvick 9 hours ago | parent | prev | next [-]

I run qwen3.5 122b on a Framework Desktop at 35/ts as a daily driver doing security and OS systems and software engineering.

Never paid an LLM provider and I have no reason to ever start.

mixermachine 8 hours ago | parent [-]

What spec of Framework Desktop do you run this on?

the_pwner224 7 hours ago | parent | next [-]

If you're looking to buy new hardware, also consider the Asus Rog Flow Z13. It has the same chip as the Framework desktop and is ~20% cheaper ($2,700) for the 128 GB spec while coming in a tablet/laptop form factor. It's capped at a slightly lower power but Strix Halo scales down very well in TDP - I never even use the max power mode on my Z13 because you don't really get any extra perf.

The only downside is that I suspect the Framework would be a decent bit quieter under load (not that this thing is abnormally loud). As well as you're limited to a single M.2 2230 internal SSD slot in this (I believe Micron recently launched a 4 TB model, but generally you'll max out at 2 TB without using an external enclosure).

I don't have anything against the Framework, I'm sure it's a great machine, but the Z13 is an incredible portable all-in-one device that can handle everything from general PC use to gaming to tablet/entertainment to LLMs & high perf.

lrvick 18 minutes ago | parent [-]

The main reason to get the framework is the DIY edition for the mini-itx form factor board and support. If you do not care about those then any cheap proprietary format box with the Strix Halo chip will do.

I put my boards in mini itx rack mounts personally so framework is the only option.

breisa 8 hours ago | parent | prev [-]

There is only one and for this model you need the one with 128GiB RAM.

lrvick 19 minutes ago | parent | next [-]

I have the 128 but for Qwen3.5 122b XS quant you only need 64GB

caminante 6 hours ago | parent | prev [-]

which start at $3k [0].

[0] https://frame.work/products/desktop-diy-amd-aimax300/configu...

fsiefken 8 hours ago | parent | prev | next [-]

Qwen3.5-9b and Qwen3.5-27b are pretty coherent on my 24G android phone

dpacmittal 8 hours ago | parent [-]

Which android phone has 24G?

jfoster 9 hours ago | parent | prev | next [-]

It can write (some) code that works. Just roughly guessing from my use, but I think of it as being a bit like ChatGPT circa-2024 in terms of capability & speed.

Disappointing if you compare it to anything else from 2026, but fairly impressive for something that can run locally at an OK speed.

logicallee 8 hours ago | parent | prev | next [-]

It's highly coherent (see my other comment for an example of its text output) and yes, it's useful. I am starting to use Gemma 4:e4b as my daily driver for simple commands it definitely knows, things that are too simple to use ChatGPT for. It is also able to code through moderately difficult coding tasks. If you want to see it in action, I posted a video about it here[1] (the 10 GB one is at the 2 minute mark and the 20 GB one says hello at 5 minutes 45 seconds into the video.) You can see its speed and output on simple consumer grade hardware, in this case a Mac Mini M4 with 24 GB of RAM.

[1] https://youtube.com/live/G5OVcKO70ns

a_paddy 9 hours ago | parent | prev [-]

I can try it for you

abstracthinking 2 hours ago | parent | prev [-]

I don't see the value in this post, are hacker news post being upvoted by bots?