Remix clone Hacker News

new | show | ask | jobs Github

	▲	anonym29 2 hours ago
		This is a breeze to do with llama.cpp, which has had Anthropic responses API support for over a month now. On your inference machine: `you@yourbox:~/Downloads/llama.cpp/bin$ ./llama-server -m <path/to/your/model.gguf> --alias <your-alias> --jinja --ctx-size 32768 --host 0.0.0.0 --port 8080 -fa on` Obviously, feel free to change your port, context size, flash attention, other params, etc. Then, on the system you're running Claude Code on: `export ANTHROPIC_BASE_URL=http://<ip-of-your-inference-system>:<port> export ANTHROPIC_AUTH_TOKEN="whatever" export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 claude --model <your-alias> [optionally: --system "your system prompt here"]` Note that the auth token can be whatever value you want, but it does need to be set, otherwise a fresh CC install will still prompt you to login / auth with Anthropic or Vertex/Azure/whatever.
	▲	huydotnet an hour ago \| parent [-]
		yup, I've been using llama.cpp for that on my PC, but on my Mac I found some cases where MLX models work best. haven't tried MLX with llama.cpp, so not sure how that will work out (or if it's even supported yet).