Ask HN: Is anybody building an alternative transformer?

Curious if anybody out there is trying to build a new model/architecture that would succeed the transformer?

I geek out on this subject in my spare time. Curious if anybody else is doing so and if you're willing to share ideas?

The MAMBA [1] model gained some traction as a potential successor. It's basically an RNN without the non linearity applied across hidden states, which makes it logarithmic time (instead of linear time) inference with a parallelizable scan [2].

It promises much faster inference with much lower compute costs, and I think up to 7B params, performs on par with transformers. I've yet to see a 40B+ model trained.

The researches of MAMBA went on to start a company called Cartesia [3], which is MAMBA applied to voice models

[1] https://jackcook.com/2024/02/23/mamba.html

[2] https://www.csd.uwo.ca/~mmorenom/HPC-Slides/Parallel_prefix_... <- Pulled up a random example from google, but Stanford CS149 has an entire lecture devoted to parallel scan.

[3] https://cartesia.ai/

▲

kla-s 5 months ago | parent | next [-]

Jamba 1.5 Large is 398B params (94B active) and weights are available.

https://arxiv.org/abs/2408.12570

Credit https://news.ycombinator.com/user?id=sanxiyn for making me aware

▲

imtringued 5 months ago | parent | prev | next [-]

Mamba isn't really a competitor to transformers. Quadratic attention exists for a reason.

Mamba's strengths lie in being a better RNN as you said. Mamba is probably better than transformers for things like object permanence over a sequence of inputs, where each input is an image, for example.

However, it would still make sense for a transformer to actually process the image by cutting it up into patches and then performing quadratic attention on that and then feeding the transformer input into mamba to get the actual output e.g. a robot action while maintaining object permanence.

▲

monroewalker 5 months ago | parent | prev [-]

Oh that would be awesome for that to work. Thanks for sharing

	▲	stavros 5 months ago \| parent [-]
		If I'm not misremembering, Mistral released a model based on MAMBA, but I haven't heard much about it since.

▲

bravura 5 months ago | parent | prev | next [-]

Check out "Attention as an RNN" by Feng et al (2024), with Bengio as a co-author. https://arxiv.org/pdf/2405.13956

Abstract: The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its many-to-one RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention’s many-tomany RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce Aaren, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on 38 datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.

▲

jmward01 5 months ago | parent | prev | next [-]

I have an internal repo that does guided window attn. I figured out that One Weird Trick to get the model to learn how to focus so that you can move a fixed window around instead of full attn. I also built NNMemory (but that appears to be an idea others hae had now too [1]) and I have a completely bonkers mechanism for non-determanistic exit logic so that the model can spin until it thinks it has a good answer. I also built scale free connections between layers to completely remove residual connections. Plus some crazy things on sacrificial training (adding parameters that are removed after training in order to boost training performance with no prod penalty). There are more crazy things I have built but they aren't out there in the wild, yet. Some of the things I have built are in my repo. [2] I personally think we can get .5b models to outperform 8b+ SOTA models out there today (even the reasoning models coming out now)

The basic transformer block has been good at kicking things off, but it is now holding us back. We need to move to recurrent architectures again and switch to fixed guided attn windows + 'think' only layers like NNMemory. Attn is distracting and we know this as humans because we often close our eyes when we think hard about a problem on the page in front of us.

[1] https://arxiv.org/abs/2502.06049

[2] https://github.com/jmward01/lmplay

▲

nextos 5 months ago | parent | prev | next [-]

The xLSTM could become a good alternative to transformers: https://arxiv.org/abs/2405.04517. On very long contexts, such as those arising in DNA models, these models perform really well.

There's a big state-space model comeback initiated by the S3-Mamba saga. RWKV, which is a hybrid between classical RNNs and transformers, is also worth mentioning.

▲

bob1029 5 months ago | parent [-]

I was just about to post this. There was a MLST podcast about it a few days ago:

https://www.youtube.com/watch?v=8u2pW2zZLCs

Lots of related papers referenced in the description.

▲

RossBencina 5 months ago | parent [-]

One claim from that podcast was that the xLSTM attention mechanism is (in practical implementation) more efficient than (transformer) flash attention, and therefore promises to significantly reduces the time/cost of test-time compute.

	▲	korbip 5 months ago \| parent [-]
		Test it out here: https://github.com/NX-AI/mlstm_kernels https://huggingface.co/NX-AI/xLSTM-7b

▲

PaulHoule 5 months ago | parent | prev | next [-]

Personally I think foundation models are for the birds, the cost of developing one is immense and the time involved is so great that you can't do many run-break-fix cycles so you will get nowhere on a shoestring. (Though maybe you can get somewhere on simple tasks and synthetic data)

Personally I am working on a reliable model trainer for classification and sequence labeling tasks that uses something like ModernBERT at the front end and some kind of LSTM on the back end.

People who hold court on machine learning forums will swear by fine-tuned BERT and similar things but they are not at all interested in talking about the reliable bit. I've read a lot of arXiv papers where somebody tries to fine-tune a BERT for a classification task, runs some arbitrarily chosen parameters they got out of another paper and it sort-of works some of the time.

It drives me up the wall that you can't use early stopping for BERT fine-tuning like I've been using on neural nets since 1990 or so and if I believe what I'm seeing I don't think the networks I've been using for BERT fine-tuning can really benefit from training sets with more than a few thousand examples, emphasis on the "few".

My assumption is that everybody else is going to be working on the flashy task of developing better foundation models and as long as they emit an embedding-per-token I can plug a better foundation model in and my models will perform better.

▲

mindcrime 5 months ago | parent | next [-]

> Personally I think foundation models are for the birds,

I might not quite that far, but I have publicly said (and will stand by the statement) that I think that training progressively larger and more complex foundation models is a waste of resources. But my view of AI is rooted in a neuro-symbolic approach, with emphasis on the "symbolic". I envision neural networks not as the core essence of an AI, but mainly as just adapters between different representations that are used by different sub-systems. And possibly as "scaffolding" where one can use the "intelligence" baked into an LLM as a bridge to get the overall system to where it can learn, and then eventually kick the scaffold down once it isn't needed anymore.

▲

tlb 5 months ago | parent | next [-]

We learned something pretty big and surprising from each new generation of LLM, for a small fraction of the time and cost of a new particle accelerator or space telescope. Compared to other big science projects, they're giving pretty good bang for the buck.

▲

PaulHoule 5 months ago | parent | prev | next [-]

I can sure talk your ear off about that one as I went way too far into the semantic web rabbit hole.

Training LLMs to use 'tools' of various types is a great idea, as it is to run them inside frameworks that check that their output satisfies various constraints. Still certain problems like the NP-complete nature of SAT solving (and many intelligent systems problems, such as word problems you'd expect an A.I. to solve, boil down to SAT solving) and problems such as the halting problem, Godel's theorem and such are still problems. I understand Doug Hofstader has softened his positions lately, but I think many of the problems set up in this book

https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach

(particularly the Achilles & Tortoise dialog) still stand today, as cringey as that book seems to me in 2025.

▲

throwawaymaths 5 months ago | parent | next [-]

i am hoping for an slm "turing tape" small language model where the tokens are instructions for a copycat engine

▲

mindcrime 5 months ago | parent | prev [-]

As somebody who considers himself something of a Semantic Web enthusiast / advocate, and has also read GEB, I can totally relate. To me, this is really one of those "THE ISSUE" things: how can we use some notion of formal logic to solve problems, without being forced to give up hope due to incompleteness and/or the Halting Problem. Clearly you have to give up something as a tradeoff for making this stuff tractable, but I suppose it's an open question what you can tradeoff and how exactly that factors into the algorithm, as well as what guarantees (if any) remain...

	▲	PaulHoule 5 months ago \| parent [-]
		I would start with the fact that there is nothing consistent or complete about humans. Penrose's argument that he is a thetan because he can do math doesn't hold water.

▲

5 months ago | parent | prev | next [-]

[deleted]

▲

dr_dshiv 5 months ago | parent | prev [-]

Good old fashioned AI, amirite

▲

mindcrime 5 months ago | parent [-]

Well, to the extent that people equate GOFAI with purely symbolic / logic-based processing, then no, not for my money anyway. I think it's possible to construct systems that use elements of symbolic processing along with sub-symbolic approaches and get useful results. I think of it as (although this is something of an over-simplification) taking symbolic reasoning, relaxing some of the constraints that go along with the guarantees that method makes out the outputs, and accepting a (hopefully only slightly) less desirable output. OR, think about flipping the whole thing around, get an output from, say, an LLM where there might be hallucination(s), and then use a symbolic reasoning system to post-process the output to ensure veracity before sending it to the user. Amazon has done some work along those lines, for example. https://aws.amazon.com/blogs/machine-learning/reducing-hallu...

Anyway this is all somewhat speculative, and I don't want to overstate the "weight" of anything I seem to be claiming here. This is just the direction my interests and inclinations have taken me in.

▲

dr_dshiv 5 months ago | parent | next [-]

Maybe gen AI coding is neurosymbolic AI, realized differently than expected

	▲	mindcrime 5 months ago \| parent [-]
		Never say never! I can't rule it out, for sure. :-)

▲

Xcelerate 5 months ago | parent | prev [-]

I’ve never liked that term “sub-symbolic”. It implies that there is something at a deeper level than what a Turing machine can compute (i.e., via the manipulation of strings of symbols), and as far as we can tell, there’s no evidence for that. It might be true, but even a quantum computer can be simulated on a classical computer. And of course neural networks run on classic computers too.

Yeah, I know that’s not what “symbol” is really referring to here in this context but I just don’t like what the semantics of the word suggests about neural networks — that they are somehow a halting oracle or hypercomputation — which they’re obviously not.

▲

dr_dshiv 5 months ago | parent | next [-]

Read Paul Smolensky’s paper on the harmonium. First restricted Boltzmann machine. The beginning helps justify subsymbolic in a pretty beautiful way.

▲

mindcrime 5 months ago | parent | prev [-]

It's not the name I would have chosen either (probably) but I wasn't around when those decisions were being made and nobody asked me for my opinion. So I just roll with it. What can ya do?

	▲	Xcelerate 5 months ago \| parent [-]
		Oh for sure! Wasn’t critiquing your comment at all. I’ve seen the term a lot lately and it just made me wonder how much the industry is using it as a misleading hype factor. E.g., LLMs are “better” than Turing machines because they are operating at a level “below” Turing machines even though the comparison doesn’t make sense, as symbolic computation isn’t referring to the symbol-manipulating nature of Turing machines in the first place.

▲

NewUser76312 5 months ago | parent | prev [-]

Yeah I've been wondering how one can contribute and build in the LLM and AI world without the resources to work on foundation models.

Because personally I'm not a product/GPT wrapper person - it just doesn't suit my interests.

So then what can one do that's meaningful and valuable? Probably something around finetuning?

▲

happytoexplain 5 months ago | parent | prev | next [-]

I hate that popular domains take ownership of highly generic words. Many years ago, I struggled for a while to understand that when people say "frontend" they often mean a website frontend, even without any further context.

▲

perrygeo 5 months ago | parent | next [-]

The worst offender is "feature". In my domain (ML and geo) we have three definitions.

Feature could be referring to some addition to the user-facing product, a raster input to machine learning, or a vector entity in GeoJSON. Context is the only tool we have to make the distinction, it gets really confusing when you're working on features that involve querying the features with features.

	▲	janalsncm 5 months ago \| parent \| next [-]
		You can say the same thing about “model” even in ML. Depending on the context it can be quite confusing: 1) an architecture described in a paper 2) the trained weights of a specific instantiation of architecture 3) a chunk of code/neural net that accomplishes a task, agnostic to the above definitions
	▲	aqueueaqueue 5 months ago \| parent \| prev [-]
		Inference has 2 meanings too

▲

ArthurStacks 5 months ago | parent | prev [-]

That has been the case for about 30 years

▲

solresol 5 months ago | parent | prev | next [-]

I tried... it started with the idea was that log loss might not be the best option for training, and maybe it should be a loss related to how wrong the predicted word was. Predicting "dog" instead of "cat" should be less penalised than predicting "running".

That turns out to be an ultrametric loss, and the derivative of an ultrametric loss is zero in a large region around any local minimum, so it can't be trained by gradient descent -- it has to be trained by search.

Punchline: it's about one million times less effective than a more traditional architecture. https://github.com/solresol/ultratree-results

▲

janalsncm 5 months ago | parent | prev | next [-]

There are alternatives that optimize around the edges. Like Deepseek’s Multi-head Latent Attention, or Grouped Query Attention. DeepSeek also showed an optimization on Mixture of Experts. These are all clear improvements to the Vaswani architecture.

There are optimizations like extreme 1.58 bit quant that can be applied to anything.

There are architectures that stray farther. Like SSMs and some attempts at bringing the RNN back from the dead. And even text diffusion models that try to generate paragraphs like we generate images i.e. not word by word.

	▲	dr_dshiv 5 months ago \| parent [-]
		Mixture of depths, too.

▲

mvieira38 5 months ago | parent | prev | next [-]

Related: There was buzz last year about Kolmogorov Arnold Networks, and https://arxiv.org/abs/2409.10594 was claiming KANs perform better than standard MLPs in the transformer architecture. Does anyone know of these being explored in the LLM space? KANs seem to have better properties regarding memory if I'm not mistaken.

	▲	pineapple_sauce 5 months ago \| parent [-]
		I believe KAN hype died off due to practical reasons (e.g. FLOPs from implementation) and empirical results, i.e. people reproduced KANs and they found the claims/results made in the original paper were misleading. Here's a paper showing KANs are no better than MLPs, if anything they are typically worse when comparing fairly. https://arxiv.org/pdf/2407.16674

▲

Analemma_ 5 months ago | parent | prev | next [-]

Literally everybody doing cutting edge AI research is trying to replace the transformer, because transformers have a bunch of undesirable properties like being quadratic in context window size. But they're also surprisingly resilient: despite the billions of dollars and man-hours poured into the field and many attempted improvements, cutting-edge models aren't all that different architecturally from the original attention paper, aside from their size and a few incidental details like the ReLU activation function, because nobody has found anything better yet.

I do expect transformers to be replaced eventually, but they do seem to have their own "bitter lesson" where trying to outperform them usually ends in failure.

	▲	PaulHoule 5 months ago \| parent [-]
		My guess is there is a cost-capability tradeoff such that the O(N^2) really is buying you something you couldn't get for O(N). Behind that, there really are intelligent systems problems that boil down to solving SAT and should be NP-complete... LLMs may be able to short circuit those problems and get lucky guesses quite frequently, maybe the 'hallucinations' won't go away for anything O(N^2).

▲

hztar 5 months ago | parent | prev | next [-]

You have stuff like: https://www.literal-labs.ai/tsetlin-machines/ and https://tsetlinmachine.org/ European initiatives..

▲

mvieira38 5 months ago | parent [-]

52x less energy is crazy. Seems like it's in the veeery early stages, though, a quick search basically only yields the original paper and articles about it. This comment from the creator really shines light on the novel approach, though, which I find oddly antagonistic towards Big Tech:

"Where the Tsetlin machine currently excels is energy-constrained edge machine learning, where you can get up to 10000x less energy consumption and 1000x faster inference (https://www.mignon.ai). My goal is to create an alternative to BigTech’s black boxes: free, green, transparent, and logical (http://cair.uia.no)." (https://www.reddit.com/r/MachineLearning/comments/17xoj68/co...)

	▲	hztar 5 months ago \| parent [-]
		It's true that Tsetlin Machines are currently a fringe area of ML research, especially compared to the focus on deep learning advancements coming out of SF and China. It's early days, but the energy efficiency potential is insane. I believe further investment could yield significant results. Having been supervised by the creator, I'm admittedly biased, but the underlying foundation in Tsetlin's learning automata gives it a solid theoretical grounding. Dedicated funding is definitely needed to explore its full potential.

▲

SiddanthEmani 5 months ago | parent | prev | next [-]

Titans has a new approach to longer and faster memory compared to transformers.

https://arxiv.org/html/2501.00663v1

▲

sgt101 5 months ago | parent | prev | next [-]

I'll see your architectural innovation and raise you a loss function revolution.

https://arxiv.org/pdf/2412.21149

▲

singularity2001 5 months ago | parent | prev | next [-]

Working on variants of Byte Latent Transformer [0] to get rid of tokenization which hinders mathematical performance and letter reflection.

In the original Byte Latent Transformer paper they reintroduce ugly caching and n-grams which I'm looking to eliminate.

As expected pure byte level Transformers need some rethinking to keep them performant, some kind of matryoshka mechanism so that long predictable byte sequences (words and phrases) get grouped into a single latent vector.

The idea is to apply this "Byteformer" not just on text but also on compiled files, songs etc.

If it's impossible to scale this architecture at least a modified tokenizer could be helpful which falls back to bytes / unicode once a number or an unfamiliar word is encountered.

[0] https://arxiv.org/abs/2412.09871

▲

htrp 5 months ago | parent | prev | next [-]

Anyone know what the rwkv people are up to now?

https://arxiv.org/abs/2305.13048

	▲	viraptor 5 months ago \| parent [-]
		You can see all the development directly from them: https://github.com/BlinkDL/RWKV-LM Last week version 7 was released and every time they make significant improvements.

▲

mbloom1915 5 months ago | parent | prev | next [-]

AI aside, the world could also use an alternative electric transformer. The backlog from main suppliers is 40+ weeks and far too expensive. There is a MAJOR manuf and supply issue here as all new build construction competes for same equipment...

	▲	aqueueaqueue 5 months ago \| parent [-]
		Could you use a capacitor? Charge them in series with the high voltage then discharge in parallel for the low voltage.

▲

kolinko 5 months ago | parent | prev | next [-]

Not a new model per se, but a new algorithm for inference - https://kolinko.github.io/effort/

▲

londons_explore 5 months ago | parent | prev | next [-]

There are a bunch of promising other ways to convert AC voltages.

The main one is the observation that the transformer uses an amount of copper and steel proportional to the power transmitted but inversely proportional to the frequency of operation.

The copper and steel cost of a transformer is the main cost (multiplied by the cost of capital for the 100+ years it will operate).

So if you can use solid state electronics to do switching at a higher frequency (switched mode power supplies, flyback designs, etc), then you can reduce the overall cost.

▲

AustinDev 5 months ago | parent [-]

Is this response AI generated or are you lost?

	▲	lurquer 5 months ago \| parent \| next [-]
		There are several Transformers successors and spinoffs across various media. Animated series like Beast Wars: Transformers (1996) introduced a new generation of transforming robots, shifting from vehicles to animals. Later, Transformers: Prime (2010) and Transformers: Cyberverse (2018) continued evolving the story and animation style. The live-action film series, starting in 2007, led to spinoffs like Bumblebee (2018) and an expanding cinematic universe. Beyond official media, franchises like Go-Bots (a competitor turned subsidiary) and Voltron (though distinct, often compared) reflect Transformers’ legacy in robot-focused storytelling.
	▲	londons_explore 5 months ago \| parent \| prev [-]
		The poster didn't specify which type of transformer...

▲

vednig 5 months ago | parent | prev | next [-]

I've a design in mind which is very simple and interesting but don't know if it would be scalable to the stage, rn it's just a superficial design inspired by IronMan's JARVIS, i'm working on preparing the architecture.

▲

neom 5 months ago | parent | prev | next [-]

https://github.com/triadicresonance/triadic this was on one of the llm discord servers few weeks ago

▲

scotty79 5 months ago | parent | prev | next [-]

This looks interesting: https://www.youtube.com/watch?v=ZLtXXFcHNOU

Chain of thought in latent space.

▲

korbip 5 months ago | parent | prev | next [-]

There is a LOT of effort in the research community currently:

1. Improving the Self-Attention in the Transformer as is, keeping the quadratic complexity, which has some theoretical advantage in principle[1]: The most hyped one probably DeepSeek's Multi-Latent Attention[15], which kind of is Attention still - but also somehow different.

2. Linear RNNs: This starts from Linear Attention[2], DeltaNet[3], RKWV[4], Retention[5], Gated Linear Attention[6], Mamba[7], Griffin[8], Based[9], xLSTM[10], TTT[11], Gated DeltaNet[12], Titans[13].

They all have an update like: C_{t} = F_{t} C_{t-1} + i_{t} k_{t} v_{t}^T with a cell state C and output h_{t} = C_{t}^T q_{t}. There's a few tricks that made these work and now being very strong competitors to Transformers. The key here is the combination of an linear associative memory (aka Hopfield Network, aka Fast Weight Programmer, aka State Expansion...) and pushing it into a sequence with gating similar to the original LSTM (input, forget, output gate) - while here this is only dependent on the current input not the previous state for linearity. The linearity is needed to make it sequence-parallelizable, there are efforts now to add non-linearities again, but let's see. Their main benefit+downside both is that they have a fixed-size state, and therefore linear (vs Transformer-quadratic) time complexity.

For larger sizes they have become popular in hybrids with Transformer (Attention) Blocks, as there are problems with long context tasks [14]. Cool thing is they can also be distilled from pre-trained Transformers with not too much performance drop [16].

3. Along the sequence dimension most things can be categorized in these two. Attention and Linear (Associative Memory Enhanced) RNNs are heavily using Matrix Multiplications and anything else would be a waste of FLOPs on current GPUs. The essence is how to store information and how to interact with it, there might be still interesting directions as other comments show. Other important topics that go into the depth / width of the model are: Mixture of Experts, Iteration (RNNs) in Depth[17].

Disclaimer: I'm author of xLSTM and we recently released a 7B model [18] trained at NXAI, currently the fastest linear RNN at this scale and performance. Happy to answer more questions on this or the current state in this field of research.

[1] https://arxiv.org/abs/2008.02217

[2] https://arxiv.org/abs/2006.16236

[3] https://arxiv.org/pdf/2102.11174

[4] https://github.com/BlinkDL/RWKV

[5] https://arxiv.org/abs/2307.08621

[6] https://arxiv.org/pdf/2312.00752

[7] https://arxiv.org/abs/2312.06635

[8] https://arxiv.org/pdf/2402.19427

[9] https://arxiv.org/abs/2402.18668

[10] https://arxiv.org/abs/2405.04517

[11] https://arxiv.org/abs/2407.04620

[12] https://arxiv.org/abs/2412.06464

[13] https://arxiv.org/abs/2501.00663

[14] https://arxiv.org/abs/2406.07887

[15] https://arxiv.org/abs/2405.04434

[16] https://arxiv.org/abs/2410.10254

[17] http://arxiv.org/abs/2502.05171

[18] https://huggingface.co/NX-AI/xLSTM-7b

▲

joshhug 5 months ago | parent | prev | next [-]

I am told that an interesting alternative is the Structured State Space for Sequence Modeling (S4). I don't personally know much about this technique, but didn't see anybody else mention this in this thread.

https://srush.github.io/annotated-s4/

▲

fred_is_fred 5 months ago | parent | prev | next [-]

The one from cybertron? The one that changes voltage levels? The AI algorithm one?

Edit: or perhaps you are working on a new insect sex regulation gene? If so that would be a great discussion here - https://en.wikipedia.org/wiki/Transformer_(gene)

▲

ai-christianson 5 months ago | parent | prev | next [-]

Not an alternative transformer like you asked for, but OptiLLM looks interesting for squeezing more juice out of existing LLMs.

▲

marshughes 5 months ago | parent | prev | next [-]

Absolutely! The QNN architecture based on quantum computing concepts shows great potential. It breaks through traditional computing models and may outperform Transformers in complex tasks. Do you have any research on the combination of quantum computing and AI?

▲

Alifatisk 5 months ago | parent | prev | next [-]

I found the model Microsoft Tay was built on to be quite interesting, forgot the name of it.

▲

seydor 5 months ago | parent | prev | next [-]

Perhaps people feel that the problem of modeling long-range token relationships has been solved. The problem is now how to get this model to produce tokens that are valid and ingenious solutions to problems, with RL or otherwise.

▲

jostmey 5 months ago | parent | prev | next [-]

My guess is that new architectures will be about doing more with less compute. For example, are there architectures that can operate at lower bit precision or better turn off and on components as required by the task?

▲

swyx 5 months ago | parent | prev | next [-]

yes: https://www.latent.space/p/2024-post-transformers

▲

dartos 5 months ago | parent | prev | next [-]

RWKV. It’s a Linux foundation project as well.

▲

freeone3000 5 months ago | parent | prev | next [-]

I’m working with a group on an RL core with models as tool use, for explainable agentic tasks with actual discovery.

▲

5 months ago | parent | prev | next [-]

[deleted]

▲

Jotalea 5 months ago | parent | prev | next [-]

Yes, I am building one.

Is it an alternative? Yes.

Is it better? Hell no.

▲

celestiallylvd1 5 months ago | parent | prev | next [-]

Yes, I am building a Perfect Language Model

▲

pestatije 5 months ago | parent | prev | next [-]

please define transformer

▲

herpdyderp 5 months ago | parent | next [-]

Until I read this comment I thought we were talking about https://en.wikipedia.org/wiki/Transformer and I was very confused...

▲

jaylaal 5 months ago | parent | prev | next [-]

Robots in disguise.

	▲	aqueueaqueue 5 months ago \| parent [-]
		$reddit_award

▲

janalsncm 5 months ago | parent | prev [-]

https://en.m.wikipedia.org/wiki/Transformer_(deep_learning_a...

▲

cshimmin 5 months ago | parent [-]

Yeah, it's literally the most important practical development in AI/ML of the decade. This is like reading an article (or headline, more like) on HN and saying "please define git".

▲

yukinon 5 months ago | parent | next [-]

Not everyone is aware of the details of AI/ML, "transformer" is actually a specific term in the space that also overlaps with "transformer" in other fields adjacent to Software Development. This is when we all need to wear our empathy hat and remind ourselves that we exist in a bubble, so when we see an overloaded term, we should add even the most minimal context to help. OP could have added "AI/ML" in the title for minimal effort and real estate. Let's not veer towards the path of elitism.

Also, the majority of developers using version control are using Git. I guarantee the majority of developers outside the AI/ML bubble do not know what a "transformer" is.

▲

cshimmin 5 months ago | parent [-]

Fair enough! Bubble or not, I certainly have very regularly (weekly?) seen headlines on hn about transformers for at least a few years now. Like how bitcoin used to be on hn frontpage every week for a couple years circa 2010 (to the derision of half of the commenters). Not everyone is in the crypto space, but they know what bitcoin is.

Anyhow I suppose the existence of such questions on hn is evidence that I'm in more of a bubble that I esteemed, thanks for the reality check :)

(also my comment was in defense of parent who linked the wiki page, which defines transformer as per request, and is being downvoted for that)

	▲	stavros 5 months ago \| parent [-]
		I, too, haven't seen the word "transformer" outside an ML context in months. Didn't stop me from wondering if the OP meant the thing that changes voltage.

▲

happytoexplain 5 months ago | parent | prev [-]

>This is like ... saying "please define git"

It's really not. "Git" has a single extremely strong definition for tech people, and a single regional slang definition. "Transformer" has multiple strong definitions for tech people, and multiple strong definitions colloquially.

Not that we can't infer the OP's meaning - just that it's nowhere near as unambiguous as "git".

▲

quantadev 5 months ago | parent | prev | next [-]

Right now as long as the rocket's heading straight up, everyone's on board with MLPs (Multilayer Perceptrons/Transformers)! Why not stay on the same rocket for now!? We're almost at AGI already!

▲

cshimmin 5 months ago | parent | next [-]

I wouldn't conflate MLPs with transformers, MLP is a small building block of almost any standard neural architecture (excluding spiking/neuromorphic types).

But to your point, the trend towards increasing inference-time compute costs, being ushered by CoT/reasoning models is one good reason to look for equally capable models that can be optimized for inference efficiency. Traditionally training was the main compute cost, so it's reasonable to ask if there's unexplored space there.

	▲	quantadev 5 months ago \| parent [-]
		What I meant by "NNs and Transformers" is that once we've found the magical ingredient (and we've found it) people tend to all be focused in the same area of research. Mankind just got kinda lucky that all this can run on essentially game graphics boards!

▲

drdeca 5 months ago | parent | prev [-]

Why are you conflating MLPs in general with specifically transformers?

▲

quantadev 5 months ago | parent [-]

I consider MLPs the building blocks of all this, and is what makes things a neural net, as opposed to some other data structure.

▲

drdeca 5 months ago | parent [-]

Sure. But that isn’t a reason to conflate the two?

OP wasn’t suggesting looking for an alternative/successor to MLPs, but for an alternative/successor to transformers (while presumably still using MLPs) in the same way that transformers are an alternative/successor to LSTMs.

	▲	quantadev 5 months ago \| parent [-]
		And that sort of proves my original point which is that we're probably gonna keep riding the same wave as far as it will go!! i.e. keep the tech stack mostly with just what we know works best.

▲

ipunchghosts 5 months ago | parent | prev | next [-]

Yes. Happy to chat if u msg me. Using RL coupled with NNs to integrate search directly into inference instead of as an afterthought like Chain of though and test time training.

▲

almosthere 5 months ago | parent [-]

Are we able to "msg" people on here?

	▲	viraptor 5 months ago \| parent \| next [-]
		Only if they explicitly make the email public in the profile. It's hidden by default.
	▲	Joel_Mckay 5 months ago \| parent \| prev \| next [-]
		No, thank god... =3
	▲	aqueueaqueue 5 months ago \| parent \| prev [-]
		Nope. Just clean carbs.

▲

ActorNightly 5 months ago | parent | prev [-]

The problem is, if they are trying to build new architecture, its just wasted effort.

To truly build AI, it needs to self configure. I tried doing some work in the past with point swarm optimization of models, but I didn't really get anywhere