In my experience the behavior variation between models and providers is different enough that the "one-line swap" idea is only true for the simplest cases. I agree the prompt lifecycle is the same as code though. The compromise I'm at currently is to use text templates checked in with the rest of the code (Handlebars but it doesn't really matter) and enforce some structure with a wrapper that takes as inputs the template name + context data + output schema + target model, and internally papers over the behavioral differences I'm ok with ignoring.

I'm curious what other practitioners are doing.

▲

dbreunig 4 hours ago | parent [-]

Model testing and swapping is one of the surprises people really appreciate DSPy for.

You're right: prompts are overfit to models. You can't just change the provider or target and know that you're giving it a fair shake. But if you have eval data and have been using a prompt optimizer with DSPy, you can try models with the one-line change followed by rerunning the prompt optimizer.

Dropbox just published a case study where they talk about this:

> At the same time, this experiment reinforced another benefit of the approach: iteration speed. Although gemma-3-12b was ultimately too weak for our highest-quality production judge paths, DSPy allowed us to reach that conclusion quickly and with measurable evidence. Instead of prolonged debate or manual trial and error, we could test the model directly against our evaluation framework and make a confident decision.

https://dropbox.tech/machine-learning/optimizing-dropbox-das...

	▲	hedgehog 2 hours ago \| parent \| next [-]
		It's not just about fitting prompts to models, it's things like how web search works, how structured outputs are handled, various knobs like level of reasoning effort, etc. I don't think the DSPy approach is bad but it doesn't really solve those issues.
	▲	persedes 4 hours ago \| parent \| prev [-]
		funnily enough the model switching is mostly thanks to litellm which dspy wraps around.