Remix.run Logo
tripplyons 8 hours ago

How does this model compare to just using a linear classifier trained on BGE embeddings?

orderone_ai an hour ago | parent [-]

Thank you for your question!

Because I'm not sure exactly what you're looking for when you say 'compares to' -- whether accuracy, speed, or architecture -- I'll hit all 3, but sorry if it's a bit much.

1. Accuracy: For simple tasks (like sentiment analysis on straightforward examples), it won't be much more accurate than a classical linear classifier, if at all.

1a. Accuracy on more diverse or challenging tasks: Because a linear classifier is just so damned simplistic, it simply cannot handle anything even resembling a reasoning task. Meanwhile, (when specifically trained), this architecture managed to get 8/10 on textual entailment tasks, which are generally considered the sort of entry level gold standard for reasoning ability.

2. Speed: It's slower than a classical classifier...in light of the ~1B params it's pushing. They're both still pretty much blazing fast, but the tiny classical classifier will definitely be faster.

3. Architecture: Here's where it gets interesting.

The architecture of the core model here differs significant from a classical linear classifier:

Classical Classifier: Input: BGE embedding (in this hypothetical) Output: Class labels through softmax Internal Architecture: No nonlinearity, no hidden layers, direct projection

General Classifier: Input: BGE Embedding Output: Class labels through nearest neighbor cosine similarity search of vocabulary Internal architecture: An input projection sparse layer, a layer for combining the 3 inputs after their upwards projection, and 14 hidden layers with nonlinearity (GELU), layernorms, skip connections -- all of the standard stuff you'd expect in an LLM, but...not in an LLM.

I hope that clears up your questions! If not, I'm happy to tell you more.