I'd be interested in seeing numbers that split out the speed of reading input (aka prefill) and the speed of generating output (aka decode). Those numbers are usually different and I remember from this Exo article that they could be quite radically different on Mac hardware: https://blog.exolabs.net/nvidia-dgx-spark/

▲

geerlingguy 13 hours ago | parent [-]

See https://github.com/geerlingguy/beowulf-ai-cluster/issues/17 for more data — I didn't save all the prompt processing times (Exo just outputs a time in ms, no other data for that), but will try to have another pass. Maybe also convince the Exo team to add a proper benchmarking capability ala `llama-bench` :)

▲

jonrouach 12 hours ago | parent [-]

or better, like you mentioned, try to convince Exo to develop in the open, so everyone gets any capability as PRs.

	▲	geerlingguy 11 hours ago \| parent [-]
		They are now, this morning they pushed all the code to the Exo repo, and archived the earlier Exo branch. We'll see how open they are now that whatever embargoed work they did with Apple is public..