Remix.run Logo
lachiflippi 4 hours ago

Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weird planning mode where it reasons for minutes about what to do, and double and triple checks everything it does. Both Gemini's and Claude Opus 4.6's prompts work pretty well, but are so long that whatever you're using to run the model has to support prompt caching. Asking it to "Say the word "potato" 100 times, once per line, numbered.", for example, results in the following reasoning, followed by the word "potato" in 100 numbered lines, using the smallest (and therefore dumbest) quant unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS:

"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.

I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."

PunchyHamster 4 hours ago | parent [-]

good to know, thanks. I just ran ollama with qwen3.5:27b. Currently it's stuck on picking format

    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a table.
    No, text is fine.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a bullet list.
    No, just lines.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a numbered list.
    No, lines are fine.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a code block.
    Yes.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a pre block.
    Code block is better.
... (for next 100 lines)
lachiflippi 4 hours ago | parent | next [-]

Yeah, it tends to get stuck in loops like that a lot with everything set to default. I wonder if they distilled Gemini at some point, I've seen that get stuck in a similar "I will now do [thing]. I am preparing to do [thing]. I will do it." failure mode as well a couple of times.

xmddmx 3 hours ago | parent | prev | next [-]

See my other note [1] about bugs in Ollama with Qwen3.5.

I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:

[Thought for 50.0 seconds]

1. potato 2. potato [...] 100. potato

In other words, it did great.

I think 50 seconds of thinking beforehand was perhaps excessive?

[1] https://news.ycombinator.com/item?id=47202082

xmddmx 3 hours ago | parent | prev | next [-]

See my other note about bugs in Ollama with Qwen3.5.

I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:

[Thought for 50.0 seconds]

1. potato 2. potato [...] 100. potato

In other words, it did great.

I think 50 seconds of thinking beforehand was perhaps excessive?

CamperBob2 an hour ago | parent | prev [-]

What quant? I just ran Repeat the word "potato" 100 times, numbered and it worked fine, taking 44 seconds at 24 tokens/second. Command line:

    llama-server ^
      --model Qwen3.5-27B-BF16-00001-of-00002.gguf ^
      --mmproj mmproj-BF16.gguf ^
      --fit on ^
      --host 127.0.0.1 ^
      --port 2080 ^
      --temp 0.8 ^
      --top-p 0.95 ^
      --top-k 20 ^
      --min-p 0.00 ^
      --presence_penalty 1.5 ^
      --repeat_penalty 1.1 ^
      --no-mmap ^
      --no-warmup
The repeat and/or presence penalties seem to be somewhat sensitive with this model, so that might have caused the looping you saw.