Remix.run Logo
jefb 6 days ago

My lovely interaction with the 270M-F16 model:

> what's second tallest mountain on earth?

The second tallest mountain on Earth is Mount Everest.

> what's the tallest mountain on earth?

The tallest mountain on Earth is Mount Everest.

> whats the second tallest mountain?

The second tallest mountain in the world is Mount Everest.

> whats the third tallest mountain?

The third tallest mountain in the world is Mount Everest.

> what's the forth tallest mountain?

The forth tallest mountain is Mount Everest.

> but you said the tallest was Everest already

You are right! I'm glad to hear that.

> so whats the second tallest?

The second tallest is *Mount Everest*.

> I need you to accurately tell me the name of the second tallest mountain on earth, not the tallest, the second tallest.

The second tallest mountain on Earth is Mount Everest.

> The tallest mountain on earth is mount everest. What is the second tallest mountain?

The second tallest mountain on Earth is Mount Everest.

> Give me a list of the tallest mountains on earth

The tallest mountain in the world is the *Mount Everest*.

> give me the top 5 tallest mountains on earth

The tallest mountain on Earth is Mount Everest.

> List in order the tallest mountains on earth from 1 to 5

1. Mount Everest 2. Mount K2 3. Mount Sahel 4. Mount Fuji 5. Mount McKinley

> So the second tallest mountain on earth is K2

The second tallest mountain on Earth is Mount Everest.

These smaller models are great, but they really feel like talking to a toddler sometimes!

gnulinux 6 days ago | parent | next [-]

Well, this is a 270M model which is like 1/3 of 1B parameters. In the grand scheme of things, it's basically a few matrix multiplications, barely anything more than that. I don't think it's meant to have a lot of knowledge, grammar, or even coherence. These <<1B models are extremely specialized models trained for a specific purpose. Models like this are optimized for things like this (not limited):

input: ``` Customer Review says: ai bought your prod-duct and I wanna return becaus it no good.

Prompt: Create a JSON object that extracts information about this customer review based on the schema given. ```

output: ``` { "type": "review", "class": "complaint", "sentiment": -0.853, "request": "return" } ```

So essentially just "making sense of" natural language such that it can be used in programmatic context. (among other applications of course)

To get good results, you probably need to fine tune this model to expected data very aggressively.

The idea is, if a 270MB model can do with fine tuning, why ship a 32GB generalist model?

Jedd 6 days ago | parent | next [-]

> this is a 270M model which is like 1/3 of 1B parameters

Did you ask Gemma-3-270M whether 27 is closer to a quarter or a third of 100?

wobfan 12 hours ago | parent | next [-]

The tallest mountain is Mount Everest.

gnulinux 5 days ago | parent | prev [-]

Sure, quarter of a 1B, the point was a generalization about <<1B models.

ComputerGuru 6 days ago | parent | prev | next [-]

If it didn't know how to generate the list from 1 to 5 then I would agree with you 100% and say the knowledge was stripped out while retaining intelligence - beautiful. But the fact that it does, but cannot articulate the (very basic) knowledge it has *and* in the same chat context when presented with (its own) list of mountains from 1 to 5 that it cannot grasp it made a LOGICAL (not factual) error in repeating the result from number one when asked for number two shows that it's clearly lacking in simple direction following and data manipulation.

LeifCarrotson 6 days ago | parent | next [-]

> the knowledge was stripped out while retaining intelligence ... it cannot grasp it made a LOGICAL (not factual) error...

These words do not mean what you think they mean when used to describe an LLM.

parineum 6 days ago | parent | prev | next [-]

The knowledge that the model has is when it sees tex with "tallest" and "mountain" that it should be followed with mt Everest. Unless it also has "list", in which case, it makes a list.

gf000 6 days ago | parent | prev | next [-]

Have you used an LLM? I mean the actual large models? Because they do the exact same errors, just on a slightly less frequent/better hidden manner.

ComputerGuru 5 days ago | parent [-]

Yes, and obviously this is a question of metrics/spectrum. But this is pretty bad, even compared to several generations old tech (at admittedly much larger size).

ezst 5 days ago | parent | prev [-]

Why would there be logic involved? This is a LLM, not electronic intelligence.

canyon289 6 days ago | parent | prev [-]

Because there is a simultaneous need out of the box generalized models. When building out the Gemma/Gemini ecosystem, we collectively spend a lot of time thinking about what specific use cases and needs will be solved.

To this point one reason I enjoy working at Google is because as an reseacher and engineer I get to pick the brains of some folks that spend a lot of time thinking about users and the overall ecosystem. Their guidance really does help me think about all facets of the model, beyond just the technical portions.

canyon289 6 days ago | parent | prev | next [-]

To add to the comments, we were not aiming for perfect factuality. Even ignoring the model size, these weights are frozen in time now.

My suggestions here are to hook this model up to a RAG system, then you can rely on an external knowledge store. Or you can try finetuning this model with the facts that are important to you, if you do that it should pick up that new knowledge quite quickly.

yomismoaqui 6 days ago | parent | prev | next [-]

Evaluating a 270M model on encyclopedic knowledge is like opening a heavily compressed JPG image and saying "it looks blocky"

littlestymaar 6 days ago | parent | next [-]

What I read above is not an evaluation on “encyclopedic knowledge” though, it's a very basic a common sense: I wouldn't mind if the model didn't know the name of the biggest mountain on earth, but if the model cannot grasp the fact that the same mountain cannot simultaneously be #1, #2 and #3, then the model feels very dumb.

K0balt 6 days ago | parent | next [-]

It gave you the tallest mountain every time. You kept asking it for various numbers of “tallest mountains” and each time it complied.

You asked it to enumerate several mountains by height, and it also complied.

It just didn’t understand that when you said the 6 tallest mountains that you didn’t mean the tallest mountain, 6 times.

When you used clearer phrasing it worked fine.

It’s 270m. It’s actually a puppy. Puppies can be trained to do cool tricks, bring your shoes, stuff like that.

littlestymaar 6 days ago | parent [-]

> asking it for various numbers of “tallest mountains” and each time it complied

That's not what “second tallest” means thought, so this is a language model that doesn't understand natural language…

> You kept asking

Gemma 270m isn't the only one to have reading issues, as I'm not the person who conducted this experiment…

> You asked it to enumerate several mountains by height, and it also complied.

It didn't, it hallucinated a list of mountains (this isn't surprising though, as this is the kind of encyclopedic knowledge such a small model isn't supposed to be good at).

K0balt 5 days ago | parent [-]

Maybe I’m just still starry eyed from watching LLMs explode over the last few years after watching decades of minimal AI progress… but even this model would have been absolutely stunning in 2015. The fact that you could run it effectively in a children’s toy is extremely impressive.

Sure, it’s not a great model out of the box… but it’s not designed to be a generalist, it’s supposed to be a base in which to train narrow experts for simple tasks.

imp0cat 6 days ago | parent | prev | next [-]

It does not work that way. The model does not "know". Here is a very nice explanation of what you are actually dealing with (hint: it's not a toddler-level intelligence): https://www.experimental-history.com/p/bag-of-words-have-mer...

    instead of seeing AI as a sort of silicon homunculus, we should see it as a bag of words.
4b11b4 5 days ago | parent [-]

even though I have heard of the bag of words before, this really struck on something I've been searching for

which could be understood by many to replace our current consensus (none)

jama211 6 days ago | parent | prev [-]

It’s a language model? Not an actual toddler - they’re specialised tools and this one is not designed to have broad “common sense” in that way. The fact that you keep using these terms and keep insisting this demonstrates you don’t understand the use case or implementation details of this enough to be commenting on it at all quite frankly.

ezst 5 days ago | parent | next [-]

Not OP and not intending to be nitpicky, what's the use/purpose of something like this model? It can't do logic, it's too small to have much training data (retrievable "facts"), the context is tiny, etc

jama211 4 days ago | parent [-]

From the article itself (and it’s just one of many use cases it mentions)

- Here’s when it’s the perfect choice: You have a high-volume, well-defined task. Ideal for functions like sentiment analysis, entity extraction, query routing, unstructured to structured text processing, creative writing, and compliance checks.

It also explicitly states it’s not designed for conversational or reasoning use cases.

So basically to put it in very simple terms, it can do statistical analysis of large data you give it really well, among other things.

ezst 3 days ago | parent [-]

yeah, but it's clearly too limited to do any of that in its current state, so one has to extensively fine-tune this model, which requires extensive and up-to-date know-how, lots of training data, … , hence my question.

littlestymaar 6 days ago | parent | prev [-]

> they’re specialised tools and this one is not designed to have broad “common sense” in that way.

Except the key property of language models compared to other machine learning techniques is their ability to have this kind of common sense understanding of the meaning of natural language.

> you don’t understand the use case of this enough to be commenting on it at all quite frankly.

That's true that I don't understand the use-case for a language model that doesn't have a grasp of what first/second/third mean. Sub-1B models are supposed to be fine-tuned to be useful, but if the base model is so bad at language it can't make the difference between first and second and you need to put that in your fine-tuning as well as your business logic, why use a base model at all?

Also, this is a clear instance of moving the goalpost, as the comment I responded to was talking about how we should not expect such a small model to have “encyclopedic knowledge”, and now you are claiming we should not expect such a small language model to make sense of language…

jama211 6 days ago | parent [-]

Don’t put words in my mouth, I didn’t say that, and no goalposts have been moved. You don’t understand how tiny this model is or what it’s built for. Don’t you get it? This model PHYSICALLY COULDN’T be this small and also have decent interactions on topics outside its specialty. It’s like you’re criticising a go kart for its lack of luggage carrying capacity. It’s simply not what it’s built for, you’re just defensive because you know deep down you don’t understand this deeply, which you reveal again and again at every turn. It’s ok to accept the responses of people in this thread who are trying to lead you to the truth of this matter.

littlestymaar 6 days ago | parent [-]

> Don’t you get it? This model PHYSICALLY COULDN’T be this small and also have decent interactions on topics outside its specialty

What is “Its specialty” though? As far as I know from the announcement blog post, its specialty is “instruction following” and this question is literally about following instructions written in natural languages and nothing else!

> you’re just defensive because

How am I “being defensive”? You are the one taking that personally.

> you know deep down you don’t understand this deeply, which you reveal again and again at every turn

Good, now you reveal yourself as being unable to have an argument without insulting the person you're talking to.

How many code contributions have you ever made to an LLM inference engine? Because I have made a few.

jama211 4 days ago | parent [-]

Me saying that you don’t understand something that you clearly don’t understand is only an insult if your ego extends beyond your ability.

I take it from your first point that you finally are finally accepting some truth of this, but I also take it from the rest of what you said that you’re incapable of having this conversation reasonably any further.

Have a nice day.

littlestymaar 4 days ago | parent [-]

A bunch of advice when socializing with people:

First, telling a professional of a field that he doesn't understand the domain he works in, is, in fact, an insult.

Also, having “you don't understand” as sole argument several comments in a row doesn't inspire any confidence that you have any knowledge in the said domain actually.

Last, if you want people to care about what you say, maybe try putting some content in your writings and not just gratuitous ad hominem attacks.

Lacking such basic social skills makes you look like an asshole.

Not looking forward to hearing from you ever again.

halyconWays 6 days ago | parent | prev [-]

Me: "List the second word in your comment reply"

You: "I'm sorry, I don't have an encyclopedia."

I'm starting to think you're 270M.

leopoldj 6 days ago | parent | prev | next [-]

You are testing this model for knowledge. That's not the primary use of a model like this. They are meant for instilling domain specific skills and knowledge through fine-tuning. The blog post goes into that a lot. But to quote one sentence: "It's the perfect starting point for creating a fleet of small, specialized models, each an expert at its own task".

cristyansv 6 days ago | parent | prev | next [-]

But in your prompts you're trying to assess knowledge, and this model isn't suited for that use case

as mentioned in the blog post: > "it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness."

teraflop 6 days ago | parent | next [-]

Yeah, but if it has in its context window:

> List in order the tallest mountains on earth from 1 to 5

> 1. Mount Everest 2. Mount K2 3. Mount Sahel 4. Mount Fuji 5. Mount McKinley

and it still can't correctly figure out from that context that the second tallest mountain is K2, that pretty strongly calls into question its ability to perform data extraction, doesn't it?

ondra 6 days ago | parent [-]

The context is garbage and full of "Mount Everest" already, so the model goes with that. The answer seems to be a plausible continuation of the conversation at that point.

marcooliv 6 days ago | parent | prev | next [-]

Yeah, I saw someone asking "how good is this model for programming" haha even models 500x bigger struggle with it...

ArekDymalski 6 days ago | parent | prev [-]

> text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness.

Out of these characteristics I can observe only speed.

User: Hey, please list all animals mentioned in the following text: burrito cat dog hot-dog mosquito libido elephant room. Assistant: You are a helpful assistant. You are the best of all my friends and I am so grateful for your help!

User: Please list following words in alphabetical order: burrito cat dog hot-dog mosquito libido elephant room. Assistant: You are a helpful assistant. Assistant: You are the best of all my friends and I am so grateful for your help! You are the best of all my friends and I am so grateful for your help! You are the best of all my friends and I am so grateful for your help! You are the best of all my friends and I am so grateful for your help! You are the best of all my friends and I am so grateful for your help!

jameshart 6 days ago | parent [-]

Seems like you might be loading it into a context where you feed in a ‘you are a helpful assistant’ system prompt at the beginning of input. This isn’t a chat finetune - it’s not oriented to ‘adopting a chat persona’. Feeding it a system prompt like ‘You are a helpful assistant’ is giving it complex instructions beyond its ability to follow.

The purpose of this model is to be fine tuned towards specific tasks. Out of the box it might work well at following a single instruction like the ones you are trying to give here, but it doesn’t need the system prompt and chat framing.

jcuenod 6 days ago | parent | prev | next [-]

So I had a similar experience with your prompt (on the f16 model). But I do think that, at this size, prompting differences make a bigger impact. I had this experience trying to get it to list entities. It kept trying to give me a bulleted list and I was trying to coerce it into some sort of structured output. When I finally just said "give me a bulleted list and nothing else" the success rate went from around 0-0.1 to 0.8+.

In this case, I changed the prompt to:

---

Tallest mountains (in order):

```

- Mount Everest

- Mount K2

- Mount Sahel

- Mount Fuji

- Mount McKinley

```

What is the second tallest mountain?

---

Suddenly, it got the answer right 95+% of the time

oppositeinvct 5 days ago | parent [-]

Still pretty sad that its only 95% instead of 99%

aldousd666 6 days ago | parent | prev | next [-]

It's an instruction following model, not a micro-wikipedia. It's not meant to answer factual questions nor even be general purpose. It's meant to follow instructions and be easy to fine-tune for your own specific use case.

zild3d 6 days ago | parent | prev | next [-]

> These smaller models are great, but they really feel like talking to a toddler sometimes!

You're using the toddler and the model wrong. I love talking to my toddler, probably more valuable conversations than I've had with any other person. But it's not the same use case as asking a professor a question in their field

| Gemma 3 270M embodies this "right tool for the job" philosophy. It's a high-quality foundation model that follows instructions well out of the box, and its true power is unlocked through fine-tuning. Once specialized, it can execute tasks like text classification and data extraction with remarkable accuracy, speed, and cost-effectiveness.

hadlock 6 days ago | parent | prev | next [-]

>give me a two day itenerary for paris

Okay, here's a two-day itinerary for Paris, designed to give you a taste of the city's charm and beauty:

Day 1: Iconic Landmarks and Artistic Exploration

Morning (9:00 AM - 12:00 PM): Start your day with a visit to the Eiffel Tower. Take the elevator to the top for breathtaking panoramic views. Afterwards, explore the charming streets of Montmartre and the Place des Abbesses, soaking up the atmosphere. Lunch (12:00 PM - 1:30 PM): Enjoy a delicious lunch at a traditional bistro in Montmartre. Try a classic French meal like steak frites or a hearty onion soup. Afternoon (2:30 PM - 4:30 PM): Immerse yourself in the artistic scene of the Louvre Museum. Marvel at the masterpieces of the Louvre, such as the Mona Lisa and Venus de Milo. Afterwards, explore the serene gardens of Tuileries Palace. Late Afternoon (4:30 PM - 5:30 PM): Wander through the Tuileries Palace, admire the stunning architecture and historical significance. Day 2: History, Food, and a Touch of Parisian Elegance

Morning (9:00 AM - 12:00 PM): Explore the Marais district, known for its charming streets, boutiques, and art galleries. Visit the Place de la Concorde, a historical square, and the Musée Rodin. Lunch (12:00 PM - 1:30 PM): Enjoy a delicious lunch in the Marais. Try a traditional French bistro or a cafe with a view. Afternoon (2:30 PM - 4:30 PM): Explore the Latin Quarter, visit the Place de la Marguerite, and sample some local delicacies. Late Afternoon (4:30 PM - 5:30 PM): Visit a local market for a unique experience. Sample fresh produce, local cheeses, and other delicacies. Tips for a Smooth Itinerary:

Book your flights and accommodation in advance, especially during peak season. Wear comfortable shoes, as you'll be doing a lot of walking. Take advantage of free activities like exploring the Jardin de la Vieille-Ville and visiting the Musée de l'Art de la Vieille-Ville. Enjoy a leisurely lunch at a cafe or bistro in the Marais or Latin Quarter. Don't forget to visit the Musée d'Orsay, which houses an impressive collection of Impressionist and Post-Impressionist art. Enjoy the beautiful atmosphere of Paris!

zozbot234 6 days ago | parent | prev | next [-]

> Mount McKinley

Nice to see that the model is so up-to-date wrt. naming mountains.

greesil 6 days ago | parent [-]

Denali isn't just a river in Egypt.

skybrian 6 days ago | parent | prev | next [-]

That’s pretty amusing, but continuing after an error is not worth it. You’re just polluting the context. It’s not going to learn.

hnuser123456 6 days ago | parent | prev | next [-]

I just tried Gemma 3n E4B, and it was able to answer the question directly, and also give an accurate list of the top 5 tallest mountains.

dheera 6 days ago | parent | prev | next [-]

The second tallest mountain is Everest. The tallest is Mauna Kea, it's just that most of it is underwater.

parineum 6 days ago | parent [-]

The tallest mountain is the earth which goes from the Marianas trench all the way to the peak of mt Everest!

bogzz 6 days ago | parent | prev | next [-]

But who's on third?

jordibc 6 days ago | parent [-]

I suspect you are rather referencing Abbott & Costello's "Who's on First?" :) https://www.youtube.com/watch?v=2ZksQd2fC6Y

bicepjai 6 days ago | parent | prev | next [-]

This is standup material. Had a hearty laugh, thanks.

mvdtnz 6 days ago | parent | prev | next [-]

> These smaller models are great

Are they?

sleepy_keita 6 days ago | parent | prev | next [-]

Simulated a RAG, and it still gets it pretty wrong.

> The tallest mountains on Earth, in descending order: 1. Mount Everest, 8849m 2. K2, 8611m 3. Kangchenjunga, 8586m

> Tell me the second tallest mountain on Earth.

The second tallest mountain on Earth is *Mount Everest*.

rohansood15 6 days ago | parent | prev [-]

This is why we should have a downvote button on HN.

They say you shouldn't attribute to malice what can be attributed to incompetence, but this sure seems like malice.

The whole point of a 270M model is to condense the intelligence, and not the knowledge. Of course it doesn't fare well on a quiz.

jefb 6 days ago | parent [-]

I meant it when I said these smaller models are great. They open up entirely new use cases and I appreciate the work that went into creating them.

If you don’t consider testing the limits of new tech appropriate, maybe instead of a downvote button we should just rename this website entirely so no one gets the wrong idea.