Remix.run Logo
danielhanchen 2 days ago

Thinking / reasoning + multimodal + tool calling.

We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!

Guide for those interested: https://unsloth.ai/docs/models/gemma-4

Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!

evilelectron 2 days ago | parent | next [-]

Daniel, your work is changing the world. More power to you.

I setup a pipeline for inference with OCR, full text search, embedding and summarization of land records dating back 1800s. All powered by the GGUF's you generate and llama.cpp. People are so excited that they can now search the records in multiple languages that a 1 minute wait to process the document seems nothing. Thank you!

danielhanchen 2 days ago | parent | next [-]

Oh appreciate it!

Oh nice! That sounds fantastic! I hope Gemma-4 will make it even better! The small ones 2B and 4B are shockingly good haha!

qingcharles 5 hours ago | parent [-]

Just switched from 3.1 Flash Lite to Gemma-4 31B on the AI Studio API since there is a generous 1500/day on non-billed projects. It's doing fantastic.

polishdude20 2 days ago | parent | prev | next [-]

Hey in really interested in your pipeline techniques. I've got some pdfs I need to get processed but processing them in the cloud with big providers requires redaction.

Wondering if a local model or a self hosted one would work just as well.

evilelectron 2 days ago | parent | next [-]

I run llama.cpp with Qwen3-VL-8B-Instruct-Q4_K_S.gguf with mmproj-F16.gguf for OCR and translation. I also run llama.cpp with Qwen3-Embedding-0.6B-GGUF for embeddings. Drupal 11 with ai_provider_ollama and custom provider ai_provider_llama (heavily derived from ai_provider_ollama) with PostreSQL and pgvector.

People on site scan the documents and upload them for archival. The directory monitor looks for new files in the archive directories and once a new file is available, it is uploaded to Drupal. Once a new content is created in Drupal, Drupal triggers the translation and embedding process through llama.cpp. Qwen3-VL-8B is also used for chat and RAG. Client is familiar with Drupal and CMS in general and wanted to stay in a similar environment. If you are starting new I would recommend looking at docling.

lwhi a day ago | parent [-]

Are you linking any of the processes using the Drupal AI module suite?

evilelectron 19 hours ago | parent [-]

Yes, they are all linked using Drupal's AI modules. I have an OpenCV application that removes the old paper look, enhances the contrast and fixes the orientation of the images before they hit llama.cpp for OCR and translation.

chrisweekly 2 days ago | parent | prev | next [-]

Disclaimer: I'm an AI novice relative to many here. FWIW last wknd I spent a couple hours setting up self-hosted n8n with ollama and gemma3:4b [EDIT: not Qwen-3.5], using PDF content extraction for my PoC. 100% local workflow, no runtime dependency on cloud providers. I doubt it'd scale very well (macbook air m4, measly 16GB RAM), but it works as intended.

patrickk a day ago | parent | next [-]

For those who wish to do OCR on photos, like receipts, or PDFs or anything really, Paperless-NGX works amazingly well and runs on a potato.

polishdude20 2 days ago | parent | prev [-]

How do you extract the content? OCR? Pdf to text then feed into qwen?

I tried something similar where I needed a bunch of tables extracted from the pdf over like 40 pages. It was crazy slow on my MacBook and innacurate

philipkglass 2 days ago | parent | next [-]

If you have a basic ARM MacBook, GLM-OCR is the best single model I have found for OCR with good table extraction/formatting. It's a compact 0.9b parameter model, so it'll run on systems with only 8 GB of RAM.

https://github.com/zai-org/GLM-OCR

Use mlx-vlm for inference:

https://github.com/zai-org/GLM-OCR/blob/main/examples/mlx-de...

Then you can run a single command to process your PDF:

  glmocr parse example.pdf

  Loading images: example.pdf
  Found 1 file(s)
  Starting Pipeline...
  Pipeline started!
  GLM-OCR initialized in self-hosted mode
  Using Pipeline (enable_layout=true)...

  === Parsing: example.pdf (1/1) ===
My test document contains scanned pages from a law textbook. It's two columns of text with a lot of footnotes. It took 60 seconds to process 5 pages on a MBP with M4 Max chip.

After it's done, you'll have a directory output/example/ that contains .md and .json files. The .md file will contain a markdown rendition of the complete document. The .json file will contain individual labeled regions from the document along with their transcriptions. If you get all the JSON objects with

  "label": "table"
from the JSON file, you can get an HTML-formatted table from each "content" section of these objects.

It might still be inaccurate -- I don't know how challenging your original tables are -- but it shouldn't be terribly slow. The tables it produced for me were good.

I have also built more complex work flows that use a mixture of OCR-specialized models and general purpose VLM models like Qwen 3.5, along with software to coordinate and reconcile operations, but GLM-OCR by itself is the best first thing to try locally.

davidbjaffe 19 hours ago | parent | next [-]

Cool! For GLM-OCR, do you use "Option 2: Self-host with vLLM / SGLang" and in that case, am I correct that there is no internet connection involved and hence connection timeouts would be avoided entirely?

philipkglass 19 hours ago | parent [-]

When you self-host, there's still a client/server relationship between your self-hosted inference server and the client that manages the processing of individual pages. You can get timeouts depending on the configured timeouts, the speed of your inference server, and the complexity of the pages you're processing. But you can let the client retry and/or raise the initial timeout limit if you keep running into timeouts.

That said, this is already a small and fast model when hosted via MLX on macOS. If you run the inference server with a recent NVidia GPU and vLLM on Linux it should be significantly faster. The big advantage with vLLM for OCR models is its continuous batching capability. Using other OCR models that I couldn't self-host on macOS, like DeepSeek 2 OCR or Chandra 2, vLLM gave dramatic throughput improvements on big documents via continuous batching if I process 8-10 pages at a time. This is with a single 4090 GPU.

polishdude20 a day ago | parent | prev [-]

Thanks! Just tried it on a 40 page pdf. Seems to work for single images but the large pdf gives me connection timeouts

philipkglass a day ago | parent [-]

I also get connection timeouts on larger documents, but it automatically retries and completes. All the pages are processed when I'm done. However, I'm using the Python client SDK for larger documents rather than the basic glmocr command line tool. I'm not sure if that makes a difference.

polishdude20 a day ago | parent [-]

Yeah looks like the cli also retries as well. I was able to get it working using a higher timeout.

chrisweekly 2 days ago | parent | prev [-]

1. Correction: I'd planned to use Qwen-3.5 but ended up using gemma3:4b.

2. The n8n workflow passes a given binary pdf to gemma, which (based on a detailed prompt) analyzes it and produces JSON output.

See https://github.com/LinkedInLearning/build-with-ai-running-lo... if you want more details. :)

tehologist a day ago | parent | prev | next [-]

Python pdftools to convert to images and tesseract to ocr them to text files. Fast free and can run on CPU.

jorl17 2 days ago | parent | prev [-]

Seconded, would also love to hear your story if you would be willing

Breza 19 hours ago | parent | prev | next [-]

I'm very active in family history and this kind of project is massively helpful, thank you

irishcoffee 19 hours ago | parent | prev [-]

> your work is changing the world

I realize this may have been hyperbole, but it sure isn't changing the world.

akavel a day ago | parent | prev | next [-]

I'm trying to disable "thinking", but it doesn't seem to work (in llama.cpp). The usual `--reasoning-budget 0` doesn't seem to change it, nor `--chat-template-kwargs '{"enable_thinking":false}'` (both with `--jinja`). Am I missing something?

EDIT: Ok, looks like there's yet another new flag for that in llama.cpp, and this one seems to work in this case: `--reasoning off`.

FWIW, I'm doing some initial tries of unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL, and for writing some Nix, I'm VERY impressed - seems significantly better than qwen3.5-35b-a3b for me for now. Example commandline on a Macbook Air M4 32gb RAM:

  llama-cli -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  -t 1.0 --top-p 0.95 --top-k 64 -fa on --no-mmproj --reasoning-budget 0 -c 32768 --jinja --reasoning off
(at release b8638, compiled with Nix)
danielhanchen a day ago | parent [-]

Oh very cool! Will check the `--reasoning off` flag as well!

Yep the models are really good!

genpfault a day ago | parent | prev | next [-]

llama.cpp (b8642) auto-fits ~200k context on this 24GB RX 7900 XTX & it shows a solid 100+ tok/s ("S_TG t/s") on the first 32k of it, nice!

    ./llama-batched-bench -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL \
    -npp 1000,2000,4000,8000,16000,32000,64000,96000,128000 -ntg 128 -npl 1 -c 0
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.416 |  2404.87 |    1.064 |   120.29 |    1.480 |   762.20 |
    |  2000 |    128 |    1 |   2128 |    0.755 |  2649.86 |    1.075 |   119.04 |    1.830 |  1162.83 |
    |  4000 |    128 |    1 |   4128 |    1.501 |  2665.72 |    1.093 |   117.08 |    2.594 |  1591.49 |
    |  8000 |    128 |    1 |   8128 |    3.142 |  2545.85 |    1.114 |   114.87 |    4.257 |  1909.47 |
    | 16000 |    128 |    1 |  16128 |    6.908 |  2316.00 |    1.189 |   107.65 |    8.097 |  1991.73 |
    | 32000 |    128 |    1 |  32128 |   16.382 |  1953.31 |    1.278 |   100.12 |   17.661 |  1819.16 |
    | 64000 |    128 |    1 |  64128 |   43.427 |  1473.74 |    1.453 |    88.12 |   44.879 |  1428.89 |
    | 96000 |    128 |    1 |  96128 |   82.227 |  1167.50 |    1.623 |    78.86 |   83.850 |  1146.42 |
    |128000 |    128 |    1 | 128128 |  133.237 |   960.69 |    1.797 |    71.25 |  135.034 |   948.86 |
danielhanchen a day ago | parent | next [-]

Oh nice that's pretty good!

spwa4 13 hours ago | parent | prev [-]

~50 tok/s on M1 Max 64Gb

trashcan2137 a day ago | parent | prev | next [-]

  and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!
Can someone explain this to me? Why is this faux-XML important here?
sroussey a day ago | parent | next [-]

These are likely individual tokens. They are super common.

pertymcpert a day ago | parent | prev [-]

That’s how the model is trained to signal the end to its generation and to indicate its thinking.

l2dy 2 days ago | parent | prev | next [-]

FYI, screenshot for the "Search and download Gemma 4" step on your guide is for qwen3.5, and when I searched for gemma-4 in Unsloth Studio it only shows Gemma 3 models.

danielhanchen 2 days ago | parent [-]

We're still updating it haha! Sorry! It's been quite complex to support new models without breaking old ones

smallerize 2 days ago | parent [-]

Speaking of which, do you think Step 3.5 Flash is going to happen or should I stop holding my breath?

danielhanchen a day ago | parent [-]

Oh quants - haha I can re-investigate it - just totally forgot about them

rizzo94 18 hours ago | parent | prev | next [-]

Huge fan of the Unsloth quants! Having reasoning and tool calling this accessible locally is a massive leap forward.

The main hurdle I've found with local tool calling is managing the execution boundaries safely. I’ve started plugging these local models into PAIO to handle that. Since it acts as a hardened execution layer with strict BYOK sovereignty, it lets you actually utilize Gemma-4's tool calling capabilities without the low-level anxiety of a hallucination accidentally wiping your drive. It’s the perfect secure gateway for these advanced local models.

Wowfunhappy 2 days ago | parent | prev | next [-]

Hi! Do you ever make quants of the base models? I'm interested in experimenting with them in non-chat contexts.

car a day ago | parent [-]

Yes, they are listed on huggingface. The instruction trained models have an 'it' in their name.

https://huggingface.co/collections/unsloth/gemma-4

Edit: Sorry, I'm not sure if this is a quant, but it says 'finetuned' from the Google Gemma 4 parent snapshot. It's the same size as the UD 8-bit quant though.

Wowfunhappy a day ago | parent [-]

Only the 'it' models seem to have quants. I was really hoping to try a base model.

kristjansson a day ago | parent [-]

Basic quantization is easy if you have enough RAM (not VRAM) to load the weights.

kapimalos a day ago | parent | prev | next [-]

Noob question. Why I would use this version over the original model?

piyh a day ago | parent [-]

1/3 the RAM & CPU consumed for 99% the performance

pentagrama 2 days ago | parent | prev | next [-]

Hey, I tried to use Unsloth to run Gemma 4 locally but got stuck during the setup on Windows 11.

At some point it asked me to create a password, and right after that it threw an error. Here’s a screenshot: https://imgur.com/a/sCMmqht

This happened after running the PowerShell setup, where it installed several things like NVIDIA components, VS Code, and Python. At the end, PowerShell tell me to open a http://localhost URL in my browser, and that’s where I was prompted to set the password before it failed.

Also, I noticed that an Unsloth icon was added to my desktop, but when I click it, nothing happens.

For context, I’m not a developer and I had never used PowerShell before. Some of the steps were a bit intimidating and I wasn’t fully sure what I was approving when clicking through.

The overall experience felt a bit rough for my level. It would be great if this could be packaged as a simple .exe or a standalone app instead of going through terminal and browser steps.

Are there any plans to make something like that?

danielhanchen 2 days ago | parent | next [-]

Apologies we just fixed it!! If you try again from source ie

irm https://unsloth.ai/install.ps1 | iex

it should work hopefully. If not - please at us on Discord and we'll help you!

The Network error is a bummer - we'll check.

And yes we're working on a .exe!!

pentagrama a day ago | parent [-]

It worked! https://imgur.com/a/SOfiRhv

Thanks, will check it out tomorrow.

Hope the unsloth-setup.exe > Windows App is coming soon! I think it will expand accessibility and user base.

danielhanchen a day ago | parent [-]

Oh nice! Glad it worked! Yes!! We're working on the app!

2 days ago | parent | prev [-]
[deleted]
Imustaskforhelp 2 days ago | parent | prev | next [-]

Daniel, I know you might hear this a lot but I really appreciate a lot of what you have been doing at Unsloth and the way you handle your communication, whether within hackernews/reddit.

I am not sure if someone might have asked this already to you, but I have a question (out of curiosity) as to which open source model you find best and also, which AI training team (Qwen/Gemini/Kimi/GLM) has cooperated the most with the Unsloth team and is friendly to work with from such perspective?

danielhanchen 2 days ago | parent [-]

Thanks a lot for the support :)

Tbh Gemma-4 haha - it's sooooo good!!!

For teams - Google haha definitely hands down then Qwen, Meta haha through PyTorch and Llama and Mistral - tbh all labs are great!

Imustaskforhelp 2 days ago | parent [-]

Now you have gotten me a bit excited for Gemma-4, Definitely gonna see if I can run the unsloth quants of this on my mac air & thanks for responding to my comment :-)

danielhanchen 2 days ago | parent [-]

Thanks! Have a super good day!!

zaat 2 days ago | parent | prev | next [-]

Thank you for your work.

You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?

petu 2 days ago | parent | next [-]

Try 26B first. 31B seems to have very heavy KV cache (maybe bugged in llama.cpp at the moment; 16K takes up 4.9GB).

edit: 31B cache is not bugged, there's static SWA cost of 3.6GB.. so IQ4_XS at 15.2GB seems like reasonable pair, but even then barely enough for 64K for 24GB VRAM. Maybe 8 bit KV quantization is fine now after https://github.com/ggml-org/llama.cpp/pull/21038 got merged, so 100K+ is possible.

> I should pick a full precision smaller model or 4 bit larger model?

4 bit larger model. You have to use quant either way -- even if by full precision you mean 8 bit, it's gonna be 26GB + overhead + chat context.

Try UD-Q4_K_XL.

danielhanchen 2 days ago | parent [-]

Yes UD-Q4_K_XL works well! :)

mixtureoftakes 2 days ago | parent [-]

what is the main difference between "normal" quants and the UD ones?

car 2 days ago | parent [-]

They explain it here:

https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs

For the best quality reply, I used the Gemma-4 31B UD-Q8_K_XL quant with Unsloth Studio to summarize the URL with web search. It produced 4.9 tok/s (including web search) on an MacBook Pro M1 Max with 64GB.

Here an excerpt of it's own words:

Unsloth Dynamic 2.0 Quantization

Dynamic 2.0 is not just a "bit-reduction" but an intelligent, per-layer optimization strategy.

- Selective Layer Quantization: Instead of making every layer 4-bit, Dynamic 2.0 analyzes every single layer and selectively adjusts the quantization type. Some critical layers may be kept at higher precision, while less critical layers are compressed more.

- Model-Specific Tailoring: The quantization scheme is custom-built for each model. For example, the layers selected for quantization in Gemma 3 are completely different from those in Llama 4.

- High-Quality Calibration: They use a hand-curated calibration dataset of >1.5M tokens specifically designed to enhance conversational chat performance, rather than just optimizing for Wikipedia-style text.

- Architecture Agnostic: While previous versions were mostly effective for MoE (Mixture of Experts) models, Dynamic 2.0 works for all architectures (both MoE and non-MoE).

2 days ago | parent | prev | next [-]
[deleted]
danielhanchen 2 days ago | parent | prev [-]

Thank you!

I presume 24B is somewhat faster since it's only 4B activated - 31B is quite a large dense model so more accurate!

ryandrake 2 days ago | parent [-]

This is one of the more confusing aspects of experimenting with local models as a noob. Given my GPU, which model should I use, which quantization of that model should I pick (unsloth tends to offer over a dozen!) and what context size should I use? Overestimate any of these, and the model just won't load and you have to trial-and-error your way to finding a good combination. The red/yellow/green indicators on huggingface.co are kind of nice, but you only know for sure when you try to load the model and allocate context.

danielhanchen 2 days ago | parent [-]

Definitely Unsloth Studio can help - we recommend specific quants (like Gemma-4) and also auto calculate the context length etc!

ryandrake 2 days ago | parent [-]

Will have to try it out. I always thought that was more for fine-tuning and less for inference.

danielhanchen 2 days ago | parent [-]

Oh yes sadly we partially mis-communicated haha - there's both and synthetic data generation + exporting!

sixhobbits a day ago | parent | prev | next [-]

Thanks for this, I gave this guide to my Claude and he oneshot the unsloth and gemma4 set up on the old macbook he runs on. It's way faster than I expected, haven't tried out local models for a few generations but will be very nice when they become useful

danielhanchen a day ago | parent [-]

Thanks! Oh nice! Ye local models are advancing much faster than I expected!

a day ago | parent | prev | next [-]
[deleted]
egeres 2 days ago | parent | prev | next [-]

Thank you and your brother for all the amazing work, it's really inspiring to others <3

danielhanchen 2 days ago | parent [-]

Thank you and appreciate it!

sillysaurusx a day ago | parent | prev | next [-]

Temperature 1.0 used to be bad for sampling. 0.7 was the better choice, and the difference in results were noticeable. You may want to experiment with this.

danielhanchen a day ago | parent [-]

You might be right, but Google's recommendation was temp 1 etc primarily because all their benchmarks were used with these numbers, so it's better reproducibility for downstream tasks

sillysaurusx a day ago | parent [-]

Fair, though putting a note in the readme about temperature 0.7 couldn't hurt.

I wonder why they do benchmarks with 1 instead of 0.7... that's strange. 0.7 or 0.8 at most gives noticeably better samples.

davedx a day ago | parent [-]

Reproducibility. They're benchmarks.

sillysaurusx a day ago | parent [-]

Reproducibility is a matter of using the same input seeds, which jax can do. 0.7 vs 1.0 would make no difference for that.

Without seeds, 0.7 would be less random than 1.0, so it'd be (slightly) more reproducible.

zkmon a day ago | parent | prev | next [-]

How does Gemma 4 26B A4B compare with Qwen3.5 35B A3B for same quants(4)

nnucera a day ago | parent | prev | next [-]

Wow! Thank you very much!

danielhanchen a day ago | parent [-]

Thanks!

zobzu 2 days ago | parent | prev | next [-]

neat, time to update my spam filter model hehe

danielhanchen a day ago | parent [-]

Haha! Ye the model is really good

Kye a day ago | parent | prev | next [-]

I haven't tried a local model in a while. I can only fit E4B in VRAM (8GB), but it's good enough that I can see it replacing Claude.ai for some things.

jquery 2 days ago | parent | prev [-]

Awesome!! Thank you SO much for this.

danielhanchen 2 days ago | parent [-]

Appreciate it!