Fine-tuned Qwen models run surprisingly well on NVIDIA Jetson hardware. We've deployed several 7B variants for edge AI tasks where latency matters more than raw accuracy – think industrial inspection, retail analytics where you can't rely on cloud connectivity. The key is LoRA fine-tuning keeps the model small enough to fit in unified memory while still hitting production-grade inference speeds. Biggest surprise was power efficiency; a Jetson Orin can run continuous inference at under 15W while a cloud round-trip burns way more energy at scale.

▲

andai 8 hours ago | parent | next [-]

Very interesting. Could you give examples of industrial tasks where lower accuracy is acceptable?

▲

w10-1 3 hours ago | parent | prev | next [-]

> NVIDIA Jetson hardware ... 15W

7B on 15W could be any of the Orin (TOPS): Nano (40), NX (100), AGX (275)

Curious if you've experimented with a larger model on the Thor (2070)

▲

embedding-shape 7 hours ago | parent | prev [-]

> where latency matters more than raw accuracy – think industrial inspection

Huh? Why would industrial inspection, in particular, benefit from lower latency in exchange for accuracy? Sounds a bit backwards, but maybe I'm missing something obvious.

▲

someotherperson 6 hours ago | parent [-]

At a very high level, think fruit sorting[0] where the conveyor belt doesn't stop rolling and you need to rapidly respond, and all the way through to monitoring for things like defects in silicon wafers and root causing it. Some of these issues aren't problematic on their own, but you can aggregate data over time to see if a particular machine, material or process within a factory is degrading over time. This might not be throughout the entire factory but isolated to a particular batch of material or a particular subsection within it. This is not a hypothetical example: this is an active use case.

[0] https://www.youtube.com/watch?v=vxff_CnvPek

▲

sorenjan 5 hours ago | parent | next [-]

But that's not something you'd use an LLM for. There have been computer vision systems sorting bad peas for more than a decade[0], of course there are plenty of use cases for very fast inspection systems. But when would you use an LLM for anything like that?

[0] https://www.youtube.com/watch?v=eLDxXPziztw

▲

arcanemachiner an hour ago | parent | next [-]

Nobody said you would use an LLM for that. It's an example of a process where "industrial inspection, in particular, [would] benefit from lower latency in exchange for accuracy".

The point of their comment isn't that you would use an LLM to sort fruit. It was just an illustrative example.

	▲	sorenjan 5 minutes ago \| parent [-]
		The discussion was about fine-tuned Qwen models, not industrial inspection in general. I would also find it interesting to learn about what kind of edge AI industrial inspection task you could do with fine-tuned llms, not some handwavy answer about how sometimes latency is important in real time systems. Of course it is, so generally you don't use models with several billion parameters unless you need to.

▲

0xbadcafebee 3 hours ago | parent | prev [-]

You would use a VLM (vision language model). The model analyzes the image and outputs text, along with general context, that can drive intelligent decisions. https://tryolabs.com/blog/llms-leveraging-computer-vision

▲

embedding-shape 6 hours ago | parent | prev [-]

But why would I want to results to be done faster but less reliable, vs slower and more reliable? Feels like the sort of thing you'd favor accuracy over speed, otherwise you're just degrading the quality control?

▲

CamouflagedKiwi an hour ago | parent | next [-]

It's not that you want it to be faster, but you want the latency to be predictable and reliable, which is much more the case for local inference than sending it away over a network (and especially to the current set of frontier model providers who don't exactly have standout reliability numbers).

▲

bigyabai 6 hours ago | parent | prev | next [-]

The high-nines of fruit organization are usually not worth running a 400 billion parameter model to catch the last 3 fruit.

▲

0cf8612b2e1e 6 hours ago | parent | prev [-]

Local, offline system you control is worth a lot. Introducing an external dependency guarantees you will have downtime outside of your control.

▲

embedding-shape 3 hours ago | parent [-]

Right, but that doesn't answer why you'd need a fast 7b LLM rather than a slightly less fast 14b LLM.

	▲	0cf8612b2e1e 3 hours ago \| parent \| next [-]
		In the hypothetical fruit sorting example, if you have a hard budget of 10 msec to respond and the 7B takes 8 msec and the 14B takes 12msec, there is your imaginary answer. Regular engineering where you have to balance competing constraints instead of running the biggest available.
	▲	0xbadcafebee 3 hours ago \| parent \| prev \| next [-]
		....because sometimes people need a faster answer? There's many possible reasons someone might need speed over accuracy. In the food sorting example, if lower accuracy means you waste more peanuts, but the speed means you get rid of more bad peanuts overall, then you get fewer complaints about bad peanuts, with a tiny amount of extra material waste.
	▲	jwatte an hour ago \| parent \| prev [-]
		Hard real time is a thing in some systems. Also, the current approaches might have 85% accuracy -- if the LLM can deliver 90% accuracy while being "less exact" that's still a win!