I really appreciate this type of articles. I feel like a lot of knowledge in LLM training and inference is locked inside the heads of practitioners. Similar to compiler engineers before.

To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.

▲

radq 7 hours ago | parent | next [-]

Thank you for the kind words. We will write and share more of these.

▲

alfiedotwtf 4 hours ago | parent | prev | next [-]

> Similar to compiler engineers before.

I guess the difference here being that we have ample compiler literature and practically know 99% of all there is to know about compilers that exist in the wild vs this new field.

Until we’ve gathered and agreed on a few “dragon books” for LLMs and have explored all there is to LLMs, you’re probably right - know-how will be with the practitioners and in source code until it’s distilled (pun intended).

	▲	Melatonic 4 hours ago \| parent [-]
		Better comparison would be low level code running on smaller chips. Intersection of hardware and software engineering

▲

someonebaggy 4 hours ago | parent | prev | next [-]

Most industries are like that.

▲

rjzzleep 7 hours ago | parent | prev [-]

Gentle reminder that while most money is spent on LLM inference, the vast majority of useful AI use is in fact not LLMs. Also, more and more work is poured into making small models. One thing I like about the whole export controls saga is that people are finding creative ways to squeeze performance out of these devices as witnessed in this post. But, if you then look at solutions like vLLM, vLLM will just fill whatever VRAM is available, no matter the context size, or the model size. So then you have two things to worry about:

First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io

While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.

Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.

	▲	esperent 4 hours ago \| parent [-]
		> the vast majority of useful AI use is in fact not LLMs Can you explain what you mean here? Are you talking about small neural networks doing specific tasks?