| ▲ | seemaze 3 hours ago | ||||||||||||||||
> One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. I thought Llamafile was just a model and llama.cpp bundled in to a single binary - is this the difference between Llamafile injecting a default sysmtem prompt vs hitting the raw llama-server endpoint with no harness? That seems like comparing apples to apple pie, there's some ingredients missing. | |||||||||||||||||
| ▲ | zambelli 3 hours ago | parent | next [-] | ||||||||||||||||
I was surprised as well. I did go with an extreme (but true) example in the post. In this case, native function-calling template likely is in play. However, that doesn't explain the Lamaserver prompt vs llamafile at ~ +4pts, or vs Ollama (at ~ +30ish pts) that sits almost perfectly between llamaserver native and llamafile. The backend affects almost all model families, and was just something I've never seen really talked about. | |||||||||||||||||
| |||||||||||||||||
| ▲ | imachine1980_ 3 hours ago | parent | prev [-] | ||||||||||||||||
I wouldn't expect such difference | |||||||||||||||||