| ▲ | max-t-dev 7 hours ago | ||||||||||||||||
Author here. Caveman is a popular Claude Code plugin that compresses Claude's responses via a custom skill with intensity modes. I wanted to know whether it actually beats the simplest possible alternative, prepending "be brief." to prompts. 24 prompts, 5 arms, judged by a separate Claude against per-prompt rubrics covering required facts, required terms, and dangerous wrong claims to avoid. 120 scored responses, 100% key-point coverage across every arm, zero must_avoid triggers. Headline: "be brief." matched caveman on tokens (419 vs 401-449) and quality (0.985 vs 0.970-0.976). Caveman has real value beyond compression. Consistent output structure, intensity modes, the Auto-Clarity safety escape. But the compression itself isn't the differentiator I expected. Harness is open source and strategy-agnostic if anyone wants to add an arm: https://github.com/max-taylor/cc-compression-bench Happy to answer questions about methodology, the per-category variance findings, or the bits I cut from the writeup. | |||||||||||||||||
| ▲ | antman 7 minutes ago | parent | next [-] | ||||||||||||||||
What not try both caveman and be brief? | |||||||||||||||||
| ▲ | dataviz1000 4 hours ago | parent | prev | next [-] | ||||||||||||||||
> there was 1 run per prompt per arm My understanding is that there was only 1 run per configuration? If that is correct, because of the run-to-run variability, it really doesn't say much. It will take several trails per prompt per arm before it will look like it is stabilizing on a plot. It is prohibitively expensive so I've been running same prompt, same model 5 times in order to get a visual understanding of performance. Someone did the same with lambda calculus yesterday. I wanted to make the point about how much run-to-run variability and difference in cost with the same prompt with the same model running only 5 trials. I classified each of the thinking steps using Opus 4.6 (costs ~$4 in tokens per run just for that) and plotted them with custom flame graphs. [0] When the run-to-run variability is between 8,163 and 17,334 tokens none of these tests mean that much. | |||||||||||||||||
| |||||||||||||||||
| ▲ | adamsmark 3 hours ago | parent | prev | next [-] | ||||||||||||||||
Write caveman summary too. Fast read. | |||||||||||||||||
| ▲ | oezi 3 hours ago | parent | prev | next [-] | ||||||||||||||||
When reading your summary I was wondering how much of those 400 tokens have been consumed by the caveman ruleset. | |||||||||||||||||
| ▲ | ricardobeat 4 hours ago | parent | prev [-] | ||||||||||||||||
Thanks for sharing this, really interesting results. Slightly off-topic: it's quite apparent that you've used Claude as an editor for the blog post. Every sentence has been sanded smooth — the rough edges filed off, the voice flattened, the rhythm set to metronome. It doesn't read like writing anymore. It reads like content. Neat little triplets. Tidy paragraphs. A structure so polished it could pass a rubric, but couldn't hold a conversation. /s In my opinion that is unnecessary and detracts from a great, simple piece. I miss human writing. | |||||||||||||||||
| |||||||||||||||||