First time I heard about Llama.cpp I got it to run on my computer. Now, my computer: a Dell laptop from 2013 with 8Gb RAM and an i5 processor, no dedicated graphic card. Since I wasn't using a MGLRU enabled kernel, It took a looong time to start but wasn't OOM-killed. Considering my amount of RAM was just the minimum required, I tried one of the smallest available models.

Impressively, it worked. It was slow to spit out tokens, at a rate around a word each 1 to 5 seconds and it was able to correctly answer "What was the biggest planet in the solar system", but it quickly hallucinated talking about moons that it called "Jupterians", while I expected it to talk about Galilean Moons.

Nevertheless, LLM's really impressed me and as soon as I get my hands on better hardware I'll try to run other bigger models locally in the hope that I'll finally have a personal "oracle" able to quickly answers most questions I throw at it and help me writing code and other fun things. Of course, I'll have to check its answers before using them, but current state seems impressive enough for me, specially QwQ.

Is Any one running smaller experiments and can talk about your results? Is it already possible to have something like an open source co-pilot running locally?

▲ yjftsjthsd-h 7 months ago | parent | next [-]

You might also try https://github.com/Mozilla-Ocho/llamafile , which may have better CPU-only performance than ollama. It does require you to grab .gguf files yourself (unless you use one of their prebuilts in which case it comes with the binary!), but with that done it's really easy to use and has decent performance.

For reference, this is how I run it:

  $ cat ~/.config/systemd/user/llamafile@.service
  [Unit]
  Description=llamafile with arbitrary model
  After=network.target
  
  [Service]
  Type=simple
  WorkingDirectory=%h/llms/
  ExecStart=sh -c "%h/.local/bin/llamafile -m %h/llamafile-models/%i.gguf --server --host '::' --port 8081 --nobrowser --log-disable"
  
  [Install]
  WantedBy=default.target

And then

  systemctl --user start llamafile@whatevermodel

but you can just run that ExecStart command directly and it works.

▲

chatmasta 7 months ago | parent | next [-]

Be careful running this on work machines – it will get flagged by Crowdstrike Falcon and probably other EDR tools. In my case the first time I tried it, I just saw “Killed” and then got a DM from SecOps within two minutes.

▲

broknbottle 7 months ago | parent | next [-]

the irony, preventing and killing something that is actually useful, while we let crowdcrap hum along consuming tons of memory and bottlenecking IO so it can do snakeoil things...

▲

yjftsjthsd-h 7 months ago | parent | prev [-]

Are they specifically flagging LLMs, or do they not like Cosmopolitan Libc / APE?

	▲	chatmasta 7 months ago \| parent [-]
		Nah nothing to do with LLMs, it’s just because the method of Llamafile is very similar to malware - basically zip up an executable, concatenate it with some stuff, throw it in /tmp and execute it with a randomly generated high entropy name. (That said, after I explained it to SecOps they did tell me I would need to “consult legal” if I wanted to use a local LLM, but I’ll give them the benefit of the doubt there…)

▲

SahAssar 7 months ago | parent | prev [-]

Is that `--host` listening on non-local addresses? Might be good to default to local-only.

	▲	yjftsjthsd-h 7 months ago \| parent [-]
		Good call out; in my context yes I do want it listening for use by other machines in its subnets and deliberately set that option (including using the IPv6 form), but most people are probably better off binding to loopback. Thanks

▲ hedgehog 7 months ago | parent | prev | next [-]

Open Web UI [1] with Ollama and models like the smaller Llama, Qwen, or Granite series can work pretty well even with CPU or a small GPU. Don't expect them to contain facts (IMO not a good approach even for the largest models) but they can be very effective for data extraction and conversational UI.

1. http://openwebui.com

▲ SteelPh0enix 7 months ago | parent | prev | next [-]

Hey, author of the blog post here. Check out avante.nvim if you're already a vim/nvim user, I'm using it as assistant plugin with llama-server and it works great.

Small models, like Llama 3.2, Qwen and SmolLM are really good right now, compared to few years ago

▲ sorenjan 7 months ago | parent | prev | next [-]

You can use Ollama for serving a model locally, and Continue to use it in VSCode.

https://ollama.com/blog/continue-code-assistant

	▲	syntaxing 7 months ago \| parent \| next [-]
		Relevant telemetry information. I didn’t like how they went from opt-in to opt-out earlier this year. https://docs.continue.dev/telemetry
	▲	freehorse 7 months ago \| parent \| prev \| next [-]
		Is autocomplete working well?
	▲	homarp 7 months ago \| parent \| prev [-]
		you can do that with llama-server too

▲ loudmax 7 months ago | parent | prev | next [-]

What you describe is very similar to my own experience first running llama.cpp on my desktop computer. It was slow and inaccurate, but that's beside the point. What impressed me was that I could write a question in English, and it would understand the question, and respond in English with an internally coherent and grammatically correct answer. This is coming from a desktop, not a rack full of servers in some hyperscaler's datacenter. This was like meeting a talking dog! The fact that what it says is unreliable is completely beside the point.

I think you still need to calibrate your expectations for what you can get from consumer grade hardware without a powerful GPU. I wouldn't look to a local LLM as a useful store of factual knowledge about the world. The amount of stuff that it knows is going to be hampered by the smaller size. That doesn't mean it can't be useful, it may be very helpful for specialized domains, like coding.

I hope and expect that over the next several years, hardware that's capable of running more powerful models will become cheaper and more widely available. But for now, the practical applications of local models that don't require a powerful GPU are fairly limited. If you really want to talk to an LLM that has a sophisticated understanding of the world, you're better off using Claude or Gemeni or ChatGPT.

▲ e12e 7 months ago | parent | prev [-]

Not sure about copilot, but I recently became aware of llmware:

https://llmware-ai.github.io/llmware/

https://github.com/llmware-ai/llmware

There's also Simon Wilson's llm cli:

https://llm.datasette.io/en/stable/

https://github.com/simonw/llm

Both might help getting started with some LLM experiments.