I'm about to leave a shallow comment, but I am a bit skeptical of the supposed drop in inference costs. If AI labs saw a lot of potential there, they'd surely be bragging about it non-stop? So the fact that publicly available information is conflicted is probably a sign that at the very least, the numbers aren't amazing.

Yes I know there's no evidence and this is lazy reasoning. But there's probably a bit of truth to this line of thought.

▲ Tuna-Fish 2 hours ago | parent | next [-]

Why on earth would AI labs be bragging about how little the product they sell actually costs them to make? You don't want to do anything that reduces it's perceived value to the user, that might make them less willing to pay for it.

Also, inference costs are bound to go way down with more optimized architectures. GPUs are fundamentally not great at inference. No platform where the weights are streamed from a large pool of memory is. If the models ever quiet down, there will be massive step changes in cost/token, energy/token and tokens/second, as models are etched into silicon ala https://chatjimmy.ai/

▲

golem14 2 hours ago | parent [-]

Why would any company brag about their margins ? Yet they do, to attract investors.

▲

Tuna-Fish an hour ago | parent [-]

The key AI labs are not public companies, they are at liberty to brag about their margins to potential investors in private.

▲

bwhiting2356 an hour ago | parent | next [-]

this is changing soon

	▲	joelthelion 23 minutes ago \| parent [-]
		Not really, how much of a public company are you when 5% of your capital is public ?

▲

SiempreViernes an hour ago | parent | prev [-]

And investors will leak such claims quickly enough that this reasoning cannot plausibly hide big secrets.

▲ whatshisface 2 hours ago | parent | prev [-]

Inference has traditionally been far less expensive than training. One public example is the fact that hobbyists can run StableDiffusion ($600k training costs[1]) on their personal computers.

Speaking to your point, inference being dramatically less costly than training would not be seen as a delta from the norm. The model of providing inference for anything near the operational costs (like a utility would), would the delta from the norm if it were true.

[1] https://x.com/emostaque/status/1563870674111832066

▲ thesz 40 minutes ago | parent [-]

The difference between training and inference is 1) one have to keep intermediate results for backward pass in training and 2) computation for training double because of the backward pass.

Training is also done over batches, which increase memory requirements by several orders of magnitude. This is why training needs costly compute.

One of the ways out of this unfortunate situation is to use something like Stochastic Average Gradient Descent [1]. Examples there are mostly concerned with regularized logistic regression, which makes problem more or less convex. Neural networks are inherently non-convex. Still, maybe some ideas from there can be utilized in the context of neural networks, like use of estimated Lipshitz constant to derive curvature and appropriate learning step.

  [1] https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.pdf

	▲	janalsncm 15 minutes ago \| parent [-]
		So one way to think about it is roughly, Training is inference + backwards pass (~2x inference cost) + activations (vram overhead) + optimizer (vram overhead) + gradients (vram overhead).