I have an M4 Max with 48GB RAM. Anyone have any tips for good local models? Context length? Using the model recommended in the blog post (qwen3.5:35b-a3b-coding-nvfp4) with Ollama 0.19.0 and it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world". Is this the best that's currently achievable with my hardware or is there something that can be configured to get better results?

▲

functional_dev 2 hours ago | parent | next [-]

I did not know, that NVFP4 was handled at the silicon level... until I dug deeper here - https://vectree.io/c/llm-quantization-from-weights-to-bits-g...

▲

Octoth0rpe 5 hours ago | parent | prev | next [-]

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

That's not an unsurprising result given the pretty ambiguous query, hence all the thinking. Asking "write a simple hello world program in python3" results in a much faster response for me (m4 base w/ 24gb, using qwen3.6:9b).

▲

EagnaIonat 2 hours ago | parent | prev | next [-]

When MLX comes out you will see a huge difference. I currently moved to LMStudio as it currently supports MLX.

▲

zozbot234 5 hours ago | parent | prev | next [-]

> it can take anywhere between 6-25 seconds for a response (after lots of thinking) from me asking "Hello world".

Qwen thinking likes to second-guess itself a LOT when faced with simple/vague prompts like that. (I'll answer it this way. Generating output. Wait, I'll answer it that way. Generating output. Wait, I'll answer it this way... lather, rinse, repeat.) I suppose this is their version of "super smart fancy thinking mode". Try something more complex instead.

▲

domh 5 hours ago | parent | next [-]

OK thanks! That's helpful. I ignorantly assumed simpler prompt == faster first response.

▲

drob518 5 hours ago | parent | prev [-]

Indeed. Qwen doesn’t just second guess itself, it third and fourth guesses itself.

	▲	Kichererbsen an hour ago \| parent [-]
		Solid Terry Pratchett reference right there.

▲

fooker 3 hours ago | parent | prev | next [-]

Avoid reasoning models in any situation where you have low tokens/second

▲

xienze 5 hours ago | parent | prev | next [-]

Well, two things. First, “hi” isn’t a good prompt for these thinking models. They’ll have an identity crisis trying to answer it. Stupid, but it’s how it is. Stick to real questions.

Second, for the best performance on a Mac you want to use an MLX model.

	▲	domh 5 hours ago \| parent [-]
		Thanks! I assumed simpler == faster, but my ignorance is showing itself. I am using the model they recommended in the blog post - which I assumed was using MLX?

▲

hbbio 5 hours ago | parent | prev [-]

[dead]