Remix.run Logo
max-t-dev 7 hours ago

Author here. Caveman is a popular Claude Code plugin that compresses Claude's responses via a custom skill with intensity modes. I wanted to know whether it actually beats the simplest possible alternative, prepending "be brief." to prompts. 24 prompts, 5 arms, judged by a separate Claude against per-prompt rubrics covering required facts, required terms, and dangerous wrong claims to avoid. 120 scored responses, 100% key-point coverage across every arm, zero must_avoid triggers. Headline: "be brief." matched caveman on tokens (419 vs 401-449) and quality (0.985 vs 0.970-0.976). Caveman has real value beyond compression. Consistent output structure, intensity modes, the Auto-Clarity safety escape. But the compression itself isn't the differentiator I expected. Harness is open source and strategy-agnostic if anyone wants to add an arm: https://github.com/max-taylor/cc-compression-bench Happy to answer questions about methodology, the per-category variance findings, or the bits I cut from the writeup.

antman 7 minutes ago | parent | next [-]

What not try both caveman and be brief?

dataviz1000 4 hours ago | parent | prev | next [-]

> there was 1 run per prompt per arm

My understanding is that there was only 1 run per configuration?

If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance.

Someone did the same with lambda calculus yesterday. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them with custom flame graphs. [0]

When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much.

[0] https://adamsohn.com/lambda-variance/

max-t-dev 4 hours ago | parent [-]

Yeah fair point. The benchmark is single-run per arm-prompt pair, so the variance finding on safety categories could be noise rather than signal. Findings doc flags this for the score deltas (anything under 0.02 between arms is in the judge's noise floor) but I should have applied the same caveat to the per-question token variance, which I didn't. Will read the lambda variance write-up. Multi-trial with cost classification is the right direction. The single-shot harness was deliberately scoped for a clean compression-only comparison before adding turns or trials, but you're right that without trials the variance findings aren't as solid. Thanks for the reply.

dataviz1000 2 hours ago | parent [-]

I'm trying to wrap my mind around this. Anything you explore and share is awesome. Thanks for the blog post.

If you want to test it across coding tasks, have a look at https://github.com/adam-s/testing-claude-agent

adamsmark 3 hours ago | parent | prev | next [-]

Write caveman summary too. Fast read.

oezi 3 hours ago | parent | prev | next [-]

When reading your summary I was wondering how much of those 400 tokens have been consumed by the caveman ruleset.

ricardobeat 4 hours ago | parent | prev [-]

Thanks for sharing this, really interesting results.

Slightly off-topic: it's quite apparent that you've used Claude as an editor for the blog post. Every sentence has been sanded smooth — the rough edges filed off, the voice flattened, the rhythm set to metronome. It doesn't read like writing anymore. It reads like content. Neat little triplets. Tidy paragraphs. A structure so polished it could pass a rubric, but couldn't hold a conversation. /s

In my opinion that is unnecessary and detracts from a great, simple piece. I miss human writing.

max-t-dev 4 hours ago | parent | next [-]

Yeah definitely a good point, Claude assisted with editing and tidying up the content with the caveat that it can flatten the voice. I agree the humanity behind writing is disappearing and perhaps that's something I should consider in more detail next time. Thanks for the comment.

SwellJoe 4 hours ago | parent | prev [-]

Also extremely verbose, in standard LLM slop style. Should have told Claude to "be brief" when telling it to write this post.