Remix.run Logo
onion2k 7 hours ago

None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.

The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

janalsncm 6 hours ago | parent | next [-]

> Being able to nail a zero-shot greenfield project is relatively easy even for a small model

Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.

hollowturtle 3 hours ago | parent | next [-]

In what era spinning up a PoC required a week of work? Especially on the web. I've been a developer for roughly 20 years and that has never been the case, to the point that I believe people impressed by LLMs are the same who had a very low productivity. Today we have game jams as short as 3 days and talented people are able to produce very good PoC, with some almost complete!

spiralcoaster 3 hours ago | parent [-]

So what you're saying is that all PoC's are guaranteed to take less than a week of work.

What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.

cyanydeez 6 hours ago | parent | prev | next [-]

I love the ability to spin up any repo on github by pointing a local model at it with zero cost beyond the heat & electricity.

onion2k 5 hours ago | parent | prev | next [-]

[dead]

ai_fry_ur_brain 5 hours ago | parent | prev [-]

Yeah, and we still do take a week for people that actually care.

If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.

I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.

Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.

Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.

j_bum 4 hours ago | parent [-]

I mean, have you looked for examples of things that people using local models to build and ship? Or are you just assuming it doesn’t happen?

I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.

I’m not saying these models are perfect.

But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.

Aurornis 5 hours ago | parent | prev | next [-]

> and it can fall back to similar examples in the training data easily.

This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.

My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.

Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.

If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.

The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.

CMay 4 hours ago | parent [-]

This is my experience too. Qwen optimizes for a lot of scenarios which masks their weaker generalization compared to US frontier models.

Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.

For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.

Zambyte 5 hours ago | parent | prev | next [-]

I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.

sosodev 7 hours ago | parent | prev | next [-]

In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.

Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.

fluoridation 6 hours ago | parent | next [-]

>Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines".

Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.

tenuousemphasis 3 hours ago | parent [-]

Claude Opus with xhigh thinking is surprisingly good at figuring our details. Granted I'm only using it for little hobby projects, nothing overly complicated.

verdverm 5 hours ago | parent | prev [-]

I had good results doing an open box reimplementation. Gave qwen access to my old projects and it rebuilt it on JAX.

https://github.com/verdverm/pge-jax

mark_l_watson 3 hours ago | parent | prev | next [-]

There are several general types of tasks that a Gemma 4 12B class model works for me, including: 1) design a large project composed of small libraries that can be coded and tested in isolation. 2) clean up old coding projects: add README files, comment code, show an example of using a new API and have it update API use, etc.

All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.

esafak 6 hours ago | parent | prev | next [-]

I don't use local models but have you tried augmenting the model with code intelligence MCPs like https://github.com/DeusData/codebase-memory-mcp ?

h4ny 7 hours ago | parent | prev [-]

> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)

1. Maybe you should tell us what those limited experiments are.

2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.

3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.

snapcaster 5 hours ago | parent [-]

Nobody owes you a scientifically rigorous write up