Remix.run Logo
marcodiego 9 hours ago

First time I heard about Llama.cpp I got it to run on my computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM and an i5 processor, no dedicated graphic card. Since I wasn't using a MGLRU enabled kernel, It took a looong time to start but wasn't OOM-killed. Considering my amount of RAM was just the minimum required, I tried one of the smallest available models.

Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.

Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.

Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?

SteelPh0enix 3 hours ago | parent | next [-]

Hey, author of the blog post here. Check out avante.nvim if you're already a vim/nvim user, I'm using it as assistant plugin with llama-server and it works great.

Small models, like Llama 3.2, Qwen and SmolLM are really good right now, compared to few years ago

hedgehog 8 hours ago | parent | prev | next [-]

Open Web UI [1] with Ollama and models like the smaller Llama, Qwen, or Granite series can work pretty well even with CPU or a small GPU. Don't expect them to contain facts (IMO not a good approach even for the largest models) but they can be very effective for data extraction and conversational UI.

1. http://openwebui.com

yjftsjthsd-h 6 hours ago | parent | prev | next [-]

You might also try https://github.com/Mozilla-Ocho/llamafile , which may have better CPU-only performance than ollama. It does require you to grab .gguf files yourself (unless you use one of their prebuilts in which case it comes with the binary!), but with that done it's really easy to use and has decent performance.

For reference, this is how I run it:

  $ cat ~/.config/systemd/user/llamafile@.service
  [Unit]
  Description=llamafile with arbitrary model
  After=network.target
  
  [Service]
  Type=simple
  WorkingDirectory=%h/llms/
  ExecStart=sh -c "%h/.local/bin/llamafile -m %h/llamafile-models/%i.gguf --server --host '::' --port 8081 --nobrowser --log-disable"
  
  [Install]
  WantedBy=default.target
And then

  systemctl --user start llamafile@whatevermodel
but you can just run that ExecStart command directly and it works.
chatmasta 4 hours ago | parent | next [-]

Be careful running this on work machines – it will get flagged by Crowdstrike Falcon and probably other EDR tools. In my case the first time I tried it, I just saw “Killed” and then got a DM from SecOps within two minutes.

SahAssar 5 hours ago | parent | prev [-]

Is that `--host` listening on non-local addresses? Might be good to default to local-only.

loudmax 7 hours ago | parent | prev | next [-]

What you describe is very similar to my own experience first running llama.cpp on my desktop computer. It was slow and inaccurate, but that's beside the point. What impressed me was that I could write a question in English, and it would understand the question, and respond in English with an internally coherent and grammatically correct answer. This is coming from a desktop, not a rack full of servers in some hyperscaler's datacenter. This was like meeting a talking dog! The fact that what it says is unreliable is completely beside the point.

I think you still need to calibrate your expectations for what you can get from consumer grade hardware without a powerful GPU. I wouldn't look to a local LLM as a useful store of factual knowledge about the world. The amount of stuff that it knows is going to be hampered by the smaller size. That doesn't mean it can't be useful, it may be very helpful for specialized domains, like coding.

I hope and expect that over the next several years, hardware that's capable of running more powerful models will become cheaper and more widely available. But for now, the practical applications of local models that don't require a powerful GPU are fairly limited. If you really want to talk to an LLM that has a sophisticated understanding of the world, you're better off using Claude or Gemeni or ChatGPT.

sorenjan 9 hours ago | parent | prev [-]

You can use Ollama for serving a model locally, and Continue to use it in VSCode.

https://ollama.com/blog/continue-code-assistant

syntaxing 9 hours ago | parent | next [-]

Relevant telemetry information. I didn’t like how they went from opt-in to opt-out earlier this year.

https://docs.continue.dev/telemetry

homarp 4 hours ago | parent | prev [-]

you can do that with llama-server too