Are off-shelf GPUs (like one 3090) suitable for modern academic research on current AI advancements or is it better to rent some cloud compute?

▲

ineedasername 3 days ago | parent | next [-]

Absolutely. Your model selection has limits of course: best practice for some types of replicable research would be to to use unquantized models, but that still leaves room for smaller Gemma and Llama models.

I’m on a 4080 for a lot of work and it gets well over 50 tokens per second on inference for pretty much anything that fits in VRAM. It’s comparable to a 3090 in compute, the 3090 has 50% more vram, the 4080 has better chip-level support for certain primitives, but that actually matters slightly less using unquantized models, making the 3090 a great choice. The 4080 is better if you want more throuput on inference and use certain common quantize levels.

Training LoRa and fine tunes is highly doable. Yesterday’s project for me, as an example, was training trigger functionality into a single token unused in the vocabulary. Under 100 training examples in the data set, 10 to 50 epochs, extremely usable “magic token” results in under a few minutes at most. This is just an example.

If you look at the wealth of daily entries on arxiv in cs.ai many are using established smaller models with understood characteristics, which makes it easier to understand the result of anything you might do both in your research and in others’ being able to put your results in context.

▲

e12e 3 days ago | parent [-]

Unrelated to the topic of small LLMs:

> trigger token

I'm reminded of the "ugly t-shirt"[1] - I wonder how feasible it would be to include something like that in a model (eg: a selective blind-spot in a solution for searching through security camera footage sold to (a|another) government...).

When you see something, say something. Unless you see this; then say nothing...

[1]

> Bruce Sterling reportedly came up with the idea for the MacGuffin in William Gibson's "Zero History" - a machine readable pattern, that when spotted in footage retrieved from the vast data lake of surveillance video - would immediately corrupt the data.

> Used by "friendly" assets to perform deniable black ops on friendly territory.

	▲	ineedasername 3 days ago \| parent [-]
		That’s more or less the same methodology, though different application to what I was doing. I remember reading that passage, it sounded like magic. If you have control over the model deployment, like fine tuning, straightforward to train a single token without updating weights globally. This is why fine tunes etc. that lack provenance should never be trusted. All the people sharing home grown stuff of huggingface… PSA: Be careful. A few examples of the input, trace the input through a few iterations of token generation to isolate a point at which the model is recognizing or acting on the trigger input (so in this case the model would have to be seeing “ugly t-shirt” in some meaningful way.”) Preferably already doing something with that recognition, like logging {“person:male”, “clothing:brown t-shirt with ‘ugly’ wording”} makes it easier to notice and pinpoint an intervention. Find a few examples of the input, find a something- an intervention-that injected into the token generation, derails its behavior to garbage tokens. Train those as conversation pairs into a specific token id. The difficulty is balancing the response. Yesterday’s trials didn’t take much to have the model regurgitating the magic token everywhere when triggered. I’m also still looking for side effects, even though it was an unused token and weight updates were isolated to it— well, in some literal sense there are no unused tokens, only ones that didn’t appear in training and so have with a default that shouldn’t interact mathematically. But training like this means it will. If you don’t have control over deploying the model but it’s an open weight model then reverse engineering this sort of thing is significantly harder especially finding a usable intervention that does anything, but the more you know about the model’s architecture and vocabulary, the more it becomes gray box instead of black back probing. Functionally it’s similar to certain types of jail breaks, at least ones that don’t rely on long dependency context poisoning.

▲

spmurrayzzz 2 days ago | parent | prev | next [-]

Those cards can be great for lots of use cases, plenty of small models are very capable at the param counts which can fit in 32GB of VRAM. GPT-OSS-20B for example is a serviceable model for agentic coding use cases and it runs natively in MXFP4. So it fits comfortably on a 5090 at full 128k context. It also has enough headroom to do PEFT-style SFT or RL.

But given the high entry cost and depending on the cost of electricity in your area, it would take a number of years to amortize both the initial purchase of the card in addition to the energy cost of the compute (comparing to the compute-equivalent hourly cloud rental costs).

For context, a single 5090 rented via Runpod is currently $0.69/hr USD on-demand. Cost range on Amazon right now for a new card is running between $3200-3700 USD. Just using the raw capex alone, that's ~5k hours of GPU compute assuming you pay only on-demand. Thats 2-3 years worth of compute if you assume compute saturation for normal working hour durations. This is before you account for the cost of power, which in my city could run you upwards of $140/mo varying by season.

With that said, I have a bunch of ML servers that I built for myself. The largest one is using 2x RTX Pro 6000s and have been very happy with it. If I was only doing inference I think this would be a somewhat questionable expense, setting aside the valid motivations that some folks have related to data privacy and security. But I do a lot of finetuning and maintain private/local eval harnesses that personally for me have made it worth the investment.

▲

ipnon 3 days ago | parent | prev | next [-]

It's good to have a local GPU. That's like your dev environment. Prod is much more expensive in AI programming than in web programming. So you want to make sure everything is working before you push!

▲

ACCount37 3 days ago | parent | prev | next [-]

Research runs on a variety of scales - but "check if this new idea/method/architecture isn't completely dumb on small scale before trying to scale up" is a common enough pattern. And most of those fail on small scale.

▲

htrp 3 days ago | parent [-]

depressingly enough, things that work on small scale architectures often don't work at larger scales

	▲	ACCount37 3 days ago \| parent [-]
		Yep, most of what's remaining fails to scale. But it's still a very solid filter. Sure, there are things that don't work on small scale and then work on large scale. But they're rare, and they sure are going to be expensive to find and validate.

▲

3 days ago | parent | prev | next [-]

[deleted]

▲

i5heu 3 days ago | parent | prev | next [-]

It depends on what you want to do in this gigantic field.

▲

whimsicalism 2 days ago | parent | prev | next [-]

it is good for quick testing of stuff, but absolutely it is better to rent some cloud compute - HN skews a bit fantastical/fanatical on this issue

▲

lynndotpy 3 days ago | parent | prev [-]

If you're seriously doing deep learning research, it's very very nice to own your own GPU.

For four years of AI PhD research I worked with a 1050Ti on a personal laptop and a 2060 on a personal desktop. You can do a lot of validation and development on consumer GPUs.

That said, the OP does not train an LLM from scratch on a 3090. That would not be feasible

▲

joefourier 3 days ago | parent | next [-]

M? The OP literally did train an LLM from scratch in a 3090 (except for the tokenizer), that’s what the whole post is about.

	▲	lynndotpy 2 days ago \| parent [-]
		Good point, I worded that incorrectly and should have been more specific. OP trained an LLM from scratch, but it's GPT-2 and with even worse performance than the GPT-2 which OpenAI shipped a few years ago. I can't edit it now, but OP did not train a useful LLM from scratch. In editing for clarity and tone I think I omitted that away. Somebody searching for a reproducible way to produce a usable model on their own 3090 won't find it in this post. But someone looking to learn how to produce a usable model on their own 3090 will be educated on their post. "Not a useful LLM" is not a knock on the OP! This is an _excellent_ educational and experiential post. It includes the experimentation with different models that you'll never see in a publication. ANd it showcases the exact limitations you'll have with one 3090. (You're limited in training speed and model size, and you're also limited in how many ideas you can have cooking at once). The "experiment at home, train a model, and reproduce or fine-tune on someone elses better GPU" is tried and true. (Again, I want to re-iterate I'm not knocking OP for not producing a "usable LLM" at the end of this post. That's not the point of the post, and it's a good post. My only point is that it's not currently feasible to train your a useful general-purpose LLM on one 3090.)

▲

deskamess 3 days ago | parent | prev [-]

I have an old 2060 with 6GB (I think). I also have a work laptop 3060 with 6GB (shared to 8GB). What can I do with those? I dabble a bit here and there but I would like to run my own local LLM for 'fun'.

Thanks!

	▲	sosodev 3 days ago \| parent [-]
		If you just want to run a local LLM you could download ollama and do it in minutes. You'll be limited to small models (I would start with qwen3:1.7b) but it should be quite fast.