Remix.run Logo
sigmoid10 5 hours ago

I knew these system prompts were getting big, but holy fuck. More than 60,000 words. With the 3/4 words per token rule of thumb, that's ~80k tokens. Even with 1M context window, that is approaching 10% and you haven't even had any user input yet. And it gets churned by every single request they receive. No wonder their infra costs keep ballooning. And most of it seems to be stable between claude version iterations too. Why wouldn't they try to bake this into the weights during training? Sure it's cheaper from a dev standpoint, but it is neither more secure nor more efficient from a deployment perspective.

an0malous 5 hours ago | parent | next [-]

I’m just surprised this works at all. When I was building AI automations for a startup in January, even 1,000 word system prompts would cause the model to start losing track of some of the rules. You could even have something simple like “never do X” and it would still sometimes do X.

embedding-shape 4 hours ago | parent | next [-]

Two things; the model and runtime matters a lot, smaller/quantized models are basically useless at strict instruction following, compared to SOTA models. The second thing is that "never do X" doesn't work that well, if you want it to "never do X" you need to adjust the harness and/or steer it with "positive prompting" instead. Don't do "Never use uppercase" but instead do "Always use lowercase only", as a silly example, you'll get a lot better results. If you've trained dogs ("positive reinforcement training") before, this will come easier to you.

dataviz1000 4 hours ago | parent | prev [-]

I created a test evaluation (they friggen' stole the word harness) that runs a changed prompt comparing success pass / fail, the number of tokens and time of any change. It is an easy thing to do. The best part is I set up an orchestration pattern where one agent iterations updating the target agent prompts. Not only can it evaluate the outcome after the changes, it can update and rerun self-healing and fixing itself.

mysterydip 5 hours ago | parent | prev | next [-]

I assume the reason it’s not baked in is so they can “hotfix” it after release. but surely that many things don’t need updates afterwards. there’s novels that are shorter.

sigmoid10 5 hours ago | parent | next [-]

Yeah that was the original idea of system prompts. Change global behaviour without retraining and with higher authority than users. But this has slowly turned into a complete mess, at least for Anthropic. I'd love to see OpenAI's and Google's system prompts for comparison though. Would be interesting to know if they are just more compute rich or more efficient.

4 hours ago | parent | prev [-]
[deleted]
jatora 5 hours ago | parent | prev | next [-]

There are different sections in the markdown for different models. It is only 3-4000 words

winwang 5 hours ago | parent | prev | next [-]

That's usually not how these things work. Only parts of the prompt are actually loaded at any given moment. For example, "system prompt" warnings about intellectual property are effectively alerts that the model gets. ...Though I have to ask in case I'm assuming something dumb: what are you referring to when you said "more than 60,000 words"?

sigmoid10 5 hours ago | parent | next [-]

What you're describing is not how these things usually work. And all I did was a wc on the .md file.

bavell 4 hours ago | parent | prev [-]

The system prompt is always loaded in its entirety IIUC. It's technically possible to modify it during a conversation but that would invalidate the prefill cache for the big model providers.

formerly_proven 5 hours ago | parent | prev | next [-]

Surely the system prompt is cached across accounts?

sigmoid10 4 hours ago | parent | next [-]

You can cache K and V matrices, but for such huge matrices you'll still pay a ton of compute to calculate attention in the end even if the user just adds a five word question.

cfcf14 4 hours ago | parent | prev [-]

I would assume so too, so the costs would not be so substantial to Anthropic.

cma 4 hours ago | parent | prev [-]

> And it gets churned by every single request they receive

It gets pretty efficiently cached, but does eat the context window and RAM.