Gemini 3

▲ Gemini 3(blog.google)

679 points by preek 5 hours ago | 373 comments

https://blog.google/technology/developers/gemini-3-developer...

https://aistudio.google.com/prompts/new_chat?model=gemini-3-...

▲ lairv 2 hours ago | parent | next [-]

Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data

Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively

Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days

▲

thomasahle 2 hours ago | parent | next [-]

I also used Gemini 3 Pro Preview. It finished it 271s = 4m31s.

Sadly, the answer was wrong.

It also returned 8 "sources", like stackexchange.com, youtube.com, mpmath.org, ncert.nic.in, and kangaroo.org.pk, even though I specifically told it not to use websearch.

Still a useful tool though. It definitely gets the majority of the insights.

Prompt: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

▲

JBiserkov an hour ago | parent [-]

The requested prompt does not exist or you do not have access. If you believe the request is correct, make sure you have first allowed AI Studio access to your Google Drive, and then ask the owner to share the prompt with you.

▲

junon an hour ago | parent [-]

I thought this was a joke at first. It actually needs drive access to run someone else's prompt. Wild.

	▲	ashdksnndck an hour ago \| parent \| next [-]
		On iOS safari, it just says “Allow access to Google Drive to load this Prompt”. When I run into that UI, my first instinct is that the poster of the link is trying to phish me. That they’ve composed some kind of script that wants to read my Google Drive so it can send info back to them. I’m only going to click “allow” if I trust the sender with my data. IMO, if that’s not what is happening, this is awful product design.
	▲	dormento 29 minutes ago \| parent \| prev [-]
		Imagine the metrics though. "this quarter we've had a 12% increase on people using AI solutions in their google drive".

▲

qsort 2 hours ago | parent | prev | next [-]

To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.

But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.

▲

vjerancrnjak 42 minutes ago | parent | next [-]

How are they faster? I don’t think any ELO report actually comes from participating at a live coding contest on previously unseen problems.

	▲	qsort 33 minutes ago \| parent [-]
		My background is more on math competitions, but all of those things are essentially speed contests. The skill comes from solving hard problems within a strict time limit. If you gave people twice the time, they'd do better, but time is never going to be an issue for a computer. Comparing raw Elo ratings isn't very indicative IMHO, but I do find it plausible that in closed, game-like environments models could indeed achieve the superhuman performance the Elo comparison implies, see my other comment in this thread.

▲

nerdsniper an hour ago | parent | prev [-]

I’d love if anyone could provide examples of such AND(“ground truth”, “absolutely ridiculous”) solutions! Even if they took clever humans a long time to create.

I’m curious to explore such fun programming code. But I’m also curious to explore what knowledgeable humans consider to be both “ground truth” as well as “absolutely ridiculous” to create within the usual time constraints.

▲

qsort an hour ago | parent [-]

I'm not explaining myself right.

Stockfish is a superhuman chess program. It's routinely used in chess analysis as "ground truth": if Stockfish says you've made a mistake, it's almost certain you did in fact make a mistake[0]. Also, because it's incomparably stronger than even the very best humans, sometimes the moves it suggests are extremely counterintuitive and it would be unrealistic to expect a human to find them in tournament conditions.

Obviously software development in general is way more open-ended, but if we restrict ourselves to puzzles and competitions, which are closed game-like environments, it seems plausible to me that a similar skill level could be achieved with an agent system that's RL'd to death on that task. If you have base models that can get there, even inconsistently so, and an environment where making a lot of attempts is cheap, that's the kind of setup that RL can optimize to the moon and beyond.

I don't predict the future and I'm very skeptical of anybody who claims to do so, correctly predicting the present is already hard enough, I'm just saying that given the progress we've already made I would find plausible that a system like that could be made in a few years. The details of what it would look like are beyond my pay grade.

---

[0] With caveats in endgames, closed positions and whatnot, I'm using it as an example.

▲

pclmulqdq 42 minutes ago | parent [-]

Yeah, it is often pointed out as a brilliance in game analysis if a GM makes a move that an engine says is bad and turns out to be good. However, it only happens in very specific positions.

▲

emodendroket 36 minutes ago | parent [-]

Does that happen because the player understands some tendency of their opponent that will cause them to not play optimally? Or is it genuinely some flaw in the machine’s analysis?

	▲	pclmulqdq 28 minutes ago \| parent [-]
		It can be either one. In closed positions, it is often the latter.

▲

rbjorklin 12 minutes ago | parent | prev | next [-]

Your post made me curious to try a problem I have been coming back to ever since ChatGPT was first released: https://open.kattis.com/problems/low

I have had no success using LLM's to solve this particular problem until trying Gemini 3 just now despite solutions to it existing in the training data. This has been my personal litmus test for testing out LLM programming capabilities and a model finally passed.

▲

sedatk 28 minutes ago | parent | prev | next [-]

Just to clarify the context for future readers: the latest problem at the moment is #970: https://projecteuler.net/problem=970

▲

thomasahle 2 hours ago | parent | prev | next [-]

I tried it with gpt-5.1 thinking, and it just searched and found a solution online :p

	▲	lairv 2 hours ago \| parent [-]
		Is there a solution to this exact problem, or to related notions (renewal equation etc.)? Anyway seems like nothing beats training on test

▲

id 36 minutes ago | parent | prev | next [-]

gpt-5.1 gave me the correct answer after 2m 17s. That includes retrieving the Euler website. I didn't even have to run the Python script, it also did that.

▲

j2kun 28 minutes ago | parent | prev | next [-]

Did it search the web?

▲

orly01 2 hours ago | parent | prev [-]

Wow. Sounds pretty impressive.

▲ dwringer 3 hours ago | parent | prev | next [-]

Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.

> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.

[0] https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

▲

stalfie 2 hours ago | parent | next [-]

The subtle "wiggle" animation that the second hand makes after moving doesn't fire when it hits 12. Literally unwatchable.

▲

farazbabar 2 hours ago | parent | prev | next [-]

https://ai.studio/apps/drive/1yAxMpwtD66vD5PdnOyISiTS2qFAyq1... <- this is very nice, I was able to make seconds smooth with three iterations (it used svg initially which was jittery, but eventually this).

▲

xnx 2 hours ago | parent | prev | next [-]

This is cool. Gemini 2.5 Pro was also capable of this. Gemini was able to recreate famous piece of clock artwork in July: https://gemini.google.com/app/93087f373bd07ca2

"Against the Run": https://www.youtube.com/watch?v=7xfvPqTDOXo

▲

skybrian 2 hours ago | parent | prev | next [-]

It looks quite nice, though to nitpick, it has “quartz” and “design & engineering” for no reason.

	▲	wongarsu an hour ago \| parent [-]
		Just like actual cheap but not bottom of the barrel clocks

▲

thegrim33 3 hours ago | parent | prev | next [-]

"Allow access to Google Drive to load this Prompt."

.... why? For what possible reason? No, I'm not going to give access to my privately stored file share in order to view a prompt someone has shared. Come on, Google.

▲

LiamPowell 2 hours ago | parent [-]

You don't want to give Google access to files you've stored in Google Drive? It's also only access to an application specific folder, not all files.

	▲	tibbar 44 minutes ago \| parent [-]
		Well, you also have to allow it to train on your data. Although this is not explicitly about your Google drive data, and probably requires you to submit a prompt yourself, the barriers here are way to weak/fuzzy for me consider granting access via any account with private info.

▲

pmarreck 2 hours ago | parent | prev | next [-]

holy shit! This is actually a VERY NICE clock!

▲

dyauspitr 3 hours ago | parent | prev [-]

Having seen the page the other day this is pretty incredible. Does this have the same 2000 token limit as the other page?

	▲	dwringer 2 hours ago \| parent [-]
		This isn't using the same prompt or stack as the page from that post the other day; on aistudio it builds a web app across a few different files. It's still fairly concise but I don't think it's that much so.

▲ bnchrch 4 hours ago | parent | prev | next [-]

I've been so happy to see Google wake up.

Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.

- Gmail vs Outlook

- Drive vs Word

- Android vs iOS

- Worklife balance and high pay vs the low salary grind of before.

Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.

▲

ThrowawayR2 3 hours ago | parent | next [-]

They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.

▲

bitpush an hour ago | parent | next [-]

This is an understandable, but simplistic way of looking at the world. Are you also gonna blame Apple for mining for rare earths, because they made a successful product that requires exotic materials which needs to be mined from earth? How about hundreds of thousands of factory workers that are being subjected to inhumane conditions to assemble iPhones each year?

For every "OMG, internet is filled with ads", people are conveniently forgetting the real-world impact of ALL COMPANIES (and not just Apple) btw. Either you should be upset with the system, and not selectively at Google.

	▲	fractalf an hour ago \| parent \| next [-]
		I dont think your comment justifies calling out any form of simplistic view. It doesnt make sense. All the big players are bad. They"re companies, their one and only purpose is to make money and they will do whatever it takes to do it. Most of which does not serve human kind.
	▲	dieggsy 41 minutes ago \| parent \| prev \| next [-]
		It seems okay to me to be upset with the system and also point out the specific wrongs of companies in the right context. I actually think that's probably most effective. The person above specifically singled out Google as a reply to a comment praising the company, which seems reasonable enough. I guess you could get into whether it's a proportional response; the praise wasn't that high and also exists within the context of the system as you point out. Still, their reply doesn't necessarily indicate that they're not upset with all companies or the system.
	▲	observationist 41 minutes ago \| parent \| prev [-]
		Yes, we're absolutely holding Apple accountable for outsourcing jobs, degrading the US markets, using slave and child labor, laundering cobalt from illegal "artisanal" mines in the DRC, and whitewashing what they do by using corporate layering and shady deals to put themselves at sufficient degrees of separation from problematic labor and sources to do good PR, but not actually decoupling at all. I also hold Americans and western consumers are responsible for simply allowing that to happen. As long as the human rights abuses and corruption are 3 or 4 degrees of separation from the retailer, people seem to be perfectly OK with chattel slavery and child labor and indentured servitude and all the human suffering that sits at the base of all our wonderful technology and cheap consumer goods. If we want to have things like minimum wage and workers rights and environmental protections, then we should mandate adherence to those standards globally. If you want to sell products in the US, the entire supply chain has to conform to US labor and manufacturing and environmental standards. If those standards aren't practical, then they should be tossed out - the US shouldn't be doing performative virtue signalling as law, incentivizing companies to outsource and engage in race to the bottom exploitation of labor and resources in other countries. We should also have tariffs and import/export taxes that allow competitive free trade. It's insane that it's cheaper to ship raw materials for a car to a country in southeast asia, have it refined and manufactured into a car, and then shipped back into the US, than to simply have it mined, refined, and manufactured locally. The ethics and economics of America are fucking dumb, but it's the mega-corps, donor class, and uniparty establishment politicians that keep it that way. Apple and Google are inhuman, autonomous entities that have effectively escaped the control and direction of any given human decision tree. Any CEO or person in power that tried to significantly reform the ethics or economics internally would be ousted and memory-holed faster than you can light a cigar with a hundred dollar bill. We need term limits, no more corporation people, money out of politics, and an overhaul, or we're going to be doing the same old kabuki show right up until the collapse or AI takeover. And yeah, you can single out Google for their misdeeds. They, in particular, are responsible for the adtech surveillance ecosystem and lack of any viable alternatives by way of their constant campaign of enshittification of everything, quashing competition, and giving NGOs, intelligence agencies, and government departments access to the controls of censorship and suppression of political opposition. I haven't and won't use Google AI for anything, ever, because of any of the big labs, they are most likely and best positioned to engage in the worst and most damaging abuse possible, be it manipulation, invasion of privacy, or casual violation of civil rights at the behest of bureaucratic tyrants. If it's not illegal, they'll do it. If it's illegal, they'll only do it if it doesn't cost more than they can profit. If they profit, even after getting caught and fined and taking a PR hit, they'll do it, because "number go up" is the only meaningful metric. The only way out is principled regulation, a digital bill of rights, and campaign finance reform. There's probably no way out.

▲

ApolloFortyNine 7 minutes ago | parent | prev | next [-]

People love getting their content for free and that's what Google does.

Even 25 years ago people wouldn't even believe Youtube exists. Anyone can upload whatever they want, however often they want, Youtube will be responsible for promoting it, they'll provide to however many billions users want to view it, and they'll pay you 55% of the revenue it makes?

▲

starchild3001 5 minutes ago | parent | prev | next [-]

What kind of world do you live in? Actually Google ads tend to be some of the highest ROI for the advertiser and most likely to be beneficial for the user. Vs the pure junk ads that aren't personalized, and just banner ads that have zero relationship to me. Google Ads is the enabler of free internet. I for one am thankful to them. Else you end up paying for NYT, Washinton Post, Information etc -- virtually for any high quality web site.

▲

visarga 3 hours ago | parent | prev | next [-]

Yes, this is correct, and it happens everywhere. App Store, Play Store, YouTube, Meta, X, Amazon and even Uber - they all play in two-sided markets exploiting both its users and providers at the same time.

▲

kryogen1c an hour ago | parent | prev | next [-]

> They've poisoned the internet

And what of the people that ravenously support ads and ad-supported content, instead of paying?

What of the consumptive public? Are they not responsible for their choices?

I do not consume algorithmic content, I do not have any social media (unless you count HN for either).

You can't have it both ways. Lead by example, stop using the poison and find friends that aren't addicted. Build an offline community.

	▲	xordon 21 minutes ago \| parent [-]
		I don't understand your logic, it seems like victim blaming. Using the internet and pointing out that targeted advertising has a negative effect on society is not "having it both ways". Also, HN is by definition algorithmic content and social media, in your mind what do you think it is?

▲

nwienert an hour ago | parent | prev | next [-]

Suppressed wages to colluding with Apple to not poach.

▲

notepad0x90 an hour ago | parent | prev [-]

They're not a moral entity. corporations aren't people.

I think a lot of the harms you mentioned are real, but they're a natural consequence of capitalistic profit chasing. Governments are supposed to regulate monopolies and anti-consumer behavior like that. Instead of regulating surveillance capitalism, governments are using it to bypass laws restricting their power.

If I were a google investor, I would absolutely want them to defeat ad-blocking, ban yt-dlp, dominate the ad-market and all the rest of what you said. In capitalism, everyone looks out for their own interests, and governments ensure the public isn't harmed in the process. But any time a government tries to regulate things, the same crowd that decries this oppose government overreach.

Voters are people and they are moral entities, direct any moral outrage at us.

▲

layer8 an hour ago | parent | next [-]

Why should the collective of voters be any more of a moral entity than the collective of people who make up a corporation (which you may include its shareholders in if you want)?

It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment.

	▲	notepad0x90 18 minutes ago \| parent \| next [-]
		> Why should the collective of voters.. They're accountable as individuals not as a collective. And it so happens, they are responsible for their government in a democracy but corporations aren't responsible for running countries. > It’s perfectly valid to criticize corporations for their actions, regardless of the regulatory environment. In the free speech sense, sure. But your criticism isn't founded on solid ground. You should expect corporations to do whatever they have to do within the bounds of the law to turn a profit. Their responsibility is to their investors and employees, they have no responsibility to the general public beyond that which is laid out in the law. The increasing demand in corporations being part of the public/social moral consciousness is causing them to manipulate politics more and more, eroding what little voice the individuals have. You're trying to live in a feudal society when you treat corporations like this. If you're unhappy with the quality of Google's services, don't do business with them. If they broke the law, they should pay for it. But expecting them to be a beacon of morality is accepting that they have a role in society and government beyond mere revenue generating machines. And if you expect them to have that role, then you're also giving them the right to enforce that expectation as a matter of corporate policy instead of law. Corporate policies then become as powerful as law, and corporations have to interfere with matters of government policy on the basis of morality instead of business, so you now have an organization with lots of money and resources competing with individual voters. And then people have the nerve to complain about PACs, money in politics, billionaire's influencing the government, bribery,etc.. you can't have it both ways. Either we have a country run partly by corporations, and a society driven and controlled by them, or we don't.
	▲	svnt an hour ago \| parent \| prev [-]
		Because of the inherent capitalism structure that leads to the inevitable: the tragedy of the commons.

▲

ThrowawayR2 an hour ago | parent | prev [-]

Why are you directing the statement that "[Corporations are] not a moral entity" at me instead of the parent poster claiming that "[Google has] been the great balancing force (often for good) in the industry."? Saying that Google is a force "for good" is a claim by them that corporations can be moral entities; I agree with you that they aren't.

	▲	notepad0x90 15 minutes ago \| parent [-]
		I could have just the same I suppose, but their comment was about google being a balancing force in terms of competition and monopoly. it wasn't a praise of their moral character. They did what was best for their business and that turns out to be good for reducing monopolies. If it turned out to be monopolistic, I would be wondering what congress and the DOJ are doing about it, instead of criticizing Google for trying to turn a profit.

▲

epolanski an hour ago | parent | prev | next [-]

Outlook is much better than Gmail and so is the office suite.

It's good there's competition in the space though.

	▲	brailsafe 18 minutes ago \| parent \| next [-]
		Outlook is not better in ways that email or gmail users necessarily care about, and in my experience gets in the way more than it helps with productivity or anything it tries to be good at. I've used it in office settings because it's the default, but never in my life have I considered using it by choice. If it's better, it might not matter.
	▲	vanillax 12 minutes ago \| parent \| prev [-]
		I couldn't disagree more

▲

digbybk 4 hours ago | parent | prev | next [-]

Ironically, OpenAI was conceived as a way to balance Google's dominance in AI.

▲

kccqzy 25 minutes ago | parent | next [-]

Balance is too weak of a word. OpenAI was conceived specifically to prevent Google from getting AGI first. That was its original goal. Musk was at that time very worried about AGI being developed behind closed doors, which was why he was the driving force behind the founding of OpenAI.

▲

CobrastanJorji 2 hours ago | parent | prev | next [-]

Pffft. OpenAI was conceived to be Open, too.

▲

lemoncucumber 2 hours ago | parent [-]

It’s a common pattern for upstarts to embrace openness as a way to differentiate and gain a foothold then become progressively less open once they get bigger. Android is a great example.

▲

bitpush 2 hours ago | parent [-]

Last I checked, Android is still open source (as AOSP) and people can do whatever-the-f-they-want with the source code. Are we defining open differently?

	▲	lemoncucumber 29 minutes ago \| parent \| next [-]
		I think we're defining "less" differently. You're interpreting "less open" to mean "not open at all," which is not what I said. There's a long history of Google slowly making the experience worse if you want to take advantage of the things that make Android open. For example, by moving features that were in the AOSP into their proprietary Play Services instead [1]. Or coming soon, preventing sideloading of unverified apps if you're using a Google build of Android [2]. In both cases, it's forcing you to accept tradeoffs between functionality and openness that you didn't have to accept before. You can still use AOSP, but it's a second class experience. [1] https://arstechnica.com/gadgets/2018/07/googles-iron-grip-on... [2] https://arstechnica.com/gadgets/2025/08/google-will-block-si...
	▲	ipaddr an hour ago \| parent \| prev [-]
		Core is open source but for a device to be "Android compatible" and access the Google Play Store and other Google services, it must meet specific requirements from Google's Android Compatibility Program. These additional proprietary components are what make the final product closed source. The Android Open Source Project is not Android.

▲

dragonwriter 3 hours ago | parent | prev [-]

I thought it was a workaround to Google's complete disinterest in productizing the AI research it was doing and publishing, rather than a way to balance their dominance in a market which didn't meaningfully exist.

	▲	mattnewton 3 hours ago \| parent \| next [-]
		That’s how it turned out, but IIRC at the time of OpenAI’s founding, “AI” was search and RL which Google and deep mind were dominating, and self driving, which Waymo was leading. And OpenAI was conceptualized as a research org to compete. A lot has changed and OpenAI has been good at seeing around those corners.
	▲	jonny_eh 42 minutes ago \| parent \| prev \| next [-]
		That was actually Character.ai's founding story. Two researchers at Google that were frustrated by a lack of resources and the inability to launch an LLM based chatbot. The founders are now back at Google. OpenAI was founded based on fears that Google would completely own AI in the future.
	▲	jpadkins 2 hours ago \| parent \| prev [-]
		Elon Musk specifically gave OAI $150M early on because of the risk of Google being the only Corp that has AGI or super-intelligence. These emails were part of the record in the lawsuit.

▲

redbell 2 hours ago | parent | prev | next [-]

> Drive vs Word

You mean Drive vs OneDrive or, maybe Docs vs Word?

	▲	lemoncucumber 2 hours ago \| parent \| next [-]
		Workspace vs Office
	▲	BHSPitMonkey 2 hours ago \| parent \| prev [-]
		Surely they meant Writely vs Word

▲

storus 2 hours ago | parent | prev | next [-]

If you consider surveillance capitalism and dark pattern nudges a good thing, then sure. Gemini has the potential to obliterate their current business model completely so I wouldn't consider that "waking up".

▲

drewda 3 hours ago | parent | prev | next [-]

For what it's worth, most of those examples are acquisitions. That's not a hit against Google in particular. That's the way all big tech co's grow. But it's not necessarily representative of "innovation."

▲

charcircuit 3 hours ago | parent [-]

>most of those examples are acquisitions

Taking those products from where there were to the juggernauts they are today was not guaranteed to succeed, nor was it easy. And yes plenty of innovation happened with these products post aquisition.

	▲	hvb2 an hour ago \| parent [-]
		But there's also plenty that fail, it's just that you won't know about those. I don't think what you're saying proves that the companies that were acquired couldn't have done that themselves.

▲

63stack 4 hours ago | parent | prev | next [-]

- Making money vs general computing

▲

kevstev 2 hours ago | parent | prev | next [-]

All those examples date back to the 2000s. Android has seen some significant improvements, but everything else has stagnated if not enshittified- remember when google told us not to ever worry about deleting anything?- and then started backing up my photos without me asking and are now constantly nagging me to pay them a monthly fee?

They have done a lot, but most of it was in the "don't be evil" days and they are a fading memory.

▲

qweiopqweiop 3 hours ago | parent | prev | next [-]

Forgot to mention absolutely milking every ounce of their users attention with Youtube, plus forcing Shorts!

	▲	bitpush an hour ago \| parent \| next [-]
		Why stop at YouTube? Blame Apple for creating an additive gadget that has single handedly wasted billions of hours of collective human intelligence. Life was so much better before iPhones. But I hear you say - you can use iPhones for productive things and not just mindless brainrot. And that's the same with YouTube as well. Many waste time on YouTube, but many learn and do productive things. Dont paint everything with a single, large, coarse brush stroke.
	▲	polotics 2 hours ago \| parent \| prev [-]
		frankly when compared against TikTok, Insta, etc, YouTube is a force for good. Just script the shorts away...

▲

IlikeKitties 2 hours ago | parent | prev | next [-]

Something about bringing balance to the force not destroying it.

▲

samdoesnothing 32 minutes ago | parent | prev | next [-]

Seriously? Google is an incredibly evil company whose net contribution to society is probably only barely positive thanks to their original product (search). Since completely de-googling I've felt a lot better about myself.

▲

rvz 3 hours ago | parent | prev | next [-]

Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.

You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.

[0] https://news.ycombinator.com/item?id=34713073

▲

stephc_int13 2 hours ago | parent | prev [-]

Google is using the typical monopoly playbook as most other large orgs, and the world would be a "better place" if they are kept in check.

But at least this company is not run by a narcissistic sociopath.

▲ simonw 31 minutes ago | parent | prev | next [-]

Here are my notes and pelican benchmark, including a new, harder benchmark because the old one was getting too easy: https://simonwillison.net/2025/Nov/18/gemini-3/

▲ prodigycorp 4 hours ago | parent | prev | next [-]

I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).

For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).

This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.

edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.

▲

WhitneyLand 4 hours ago | parent | next [-]

>>benchmarks are meaningless

No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.

>>my fairly basic python benchmark

I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.

	▲	NaomiLehman 3 hours ago \| parent [-]
		they are not meaningless, but when you work a lot with LLMs and know them VERY well, then a few varied, complex prompts tell you all you need to know about things like EQ, sycophancy, and creative writing. I like to compare them using chathub using the same prompts Gemini still calls me "the architect" in half of the prompts. It's very cringe.

▲

dekhn 3 hours ago | parent | prev | next [-]

Using a single custom benchmark as a metric seems pretty unreliable to me.

Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.

	▲	prodigycorp an hour ago \| parent [-]
		after taking a walk for a bit i decided you’re right. I came to the wrong conclusion. Gemini 3 is incredibly powerful in some other stuff I’ve run. This probably means my test is a little too niche. The fact that it didn’t pass one of my tests doesn’t speak to the broader intelligence of the model per se. While i still believe in the importance of a personalized suite of benchmarks, my python one needs to be down weighted or supplanted. my bad to the google team for the cursory brush off.

▲

sosodev 4 hours ago | parent | prev | next [-]

How can you be sure that your benchmark is meaningful and well designed?

Is the only thing that prevents a benchmark from being meaningful publicity?

▲

prodigycorp 4 hours ago | parent [-]

I didn't tell you what you should think about the model. All I said is that you should have your own benchmark.

I think my benchmark is well designed. It's well designed because it's a generalization of a problem I've consistently had with LLMs on my code. Insofar that it encapsulates my coding preferences and communication style, that's the proper benchmark for me.

▲

gregsadetsky 3 hours ago | parent [-]

I asked a semi related question in a different thread [0] -- is the basic idea behind your benchmark that you specifically keep it secret to use it as an "actually real" test that was definitely withheld from training new LLMs?

I've been thinking about making/publishing a new eval - if it's not public, presumably LLMs would never get better at them. But is your fear that generally speaking, LLMs tend to (I don't want to say cheat but) overfit on known problems, but then do (generally speaking) poorly on anything they haven't seen?

Thanks

[0] https://news.ycombinator.com/item?id=45968665

▲

adastra22 3 hours ago | parent [-]

> if it's not public, presumably LLMs would never get better at them.

Why? This is not obvious to me at all.

▲

gregsadetsky 3 hours ago | parent [-]

You're correct of course - LLMs may get better at any task of course, but I meant that publishing the evals might (optimistically speaking) help LLMs get better at the task. If the eval was actually picked up / used in the training loop, of course.

▲

adastra22 3 hours ago | parent [-]

That kind of “get better at” doesn’t generalize. It will regurgitate its training data, which now includes the exact answer being looked for. It will get better at answering that exact problem.

But if you care about its fundamental reasoning and capability to solve new problems, or even just new instances of the same problem, then it is not obvious that publishing will improve this latter metric.

Problem solving ability is largely not from the pretraining data.

	▲	gregsadetsky 3 hours ago \| parent [-]
		Yeah, great point. I was considering working on the ability to dynamically generate eval questions whose solutions would all involve problem solving (and a known, definitive answer). I guess that this would be more valuable than publishing a fixed number of problems with known solutions. (and I get your point that in the end it might not matter because it's still about problem solving, not just rote memorization)

▲

thefourthchime 4 hours ago | parent | prev | next [-]

I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.

▲

bitexploder an hour ago | parent | next [-]

Something else to consider. I often have much better success with something like: Create a prompt that creates a specification for a pacman game in a single html page. Consider edge cases and key implementation details that result in bugs. <take prompt>, execute prompt. It will often yield a much better result than one generic prompt. Now that models are trained on how to generate prompts for themselves this is quite productive. You can also ask it to implement everything in stages and implement tests, and even evaluate its tests! I know that isn't quite the same as "Implement pacman on an HTML page" but still, with very minimal human effort you can get the intended result.

▲

Workaccount2 2 hours ago | parent | prev | next [-]

It made a working game for me (with a slightly expanded prompt), but the ghosts got trapped in the box after coming back from getting killed. A second prompt fixed it. The art and animation however was really impressive.

▲

ofa0e 3 hours ago | parent | prev [-]

Your benchmarks should not involve IP.

▲

sowbug 3 hours ago | parent | next [-]

The only intellectual property here would be trademark. No copyright, no patent, no trade secret. Unless someone wants to market the test results as a genuine Pac-Man-branded product, or otherwise dilute that brand, there's nothing should-y about it.

▲

bongodongobob an hour ago | parent [-]

It's not an ethics thing. It's a guardrails thing.

	▲	sowbug 25 minutes ago \| parent [-]
		That's a valid point, though an average LLM would certainly understand the difference between trademark other forms of IP. I was responding to the earlier comment, whose author later clarified that it represented an ethical stance ("stealing the hard work of some honest, human souls").

▲

ComplexSystems 3 hours ago | parent | prev [-]

Why? This seems like a reasonable task to benchmark on.

▲

adastra22 3 hours ago | parent | next [-]

Because you hit guard rails.

▲

ofa0e 3 hours ago | parent | prev [-]

Sure, reasonable to benchmark on if your goal is to find out which companies are the best at stealing the hard work of some honest, human souls.

	▲	scragz 3 hours ago \| parent \| next [-]
		correction: pacman is not a human and has no soul.
	▲	tomalbrc an hour ago \| parent \| prev [-]
		tech bros hate reality

▲

ddalex 4 hours ago | parent | prev | next [-]

I moved to using the model from python coding to golang coding and got incredible speedups in writing the correct version of the code

	▲	layer8 an hour ago \| parent [-]
		Is observed speed meaningful for a model preview? Isn’t it likely to go down once usage goes up?

▲

benterix 4 hours ago | parent | prev | next [-]

> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.

Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.

	▲	adastra22 3 hours ago \| parent \| next [-]
		I maintain a set of prompts and scripts for development using Claude Code. They are still all locked to using Sonnet 4 and Opus 4.1, because Sonnet 4.5 is flaming hot garbage. I’ve stopped trusting the benchmarks for anything.
	▲	Iulioh 4 hours ago \| parent \| prev [-]
		A lot of newer models are geared towards efficency and if you add the fact that more efficent models are trained on the output of less efficent (but more accurate) models.... GPT4/3o might be the best we will ever have

▲

testartr 4 hours ago | parent | prev | next [-]

and models are still pretty bad at playing tic-tac-toe, they can do it, but think way too much

it's easy to focus on what they can't do

	▲	big-and-small 3 hours ago \| parent [-]
		Everything is about context. When you just ask non-concrete task it's still have to parse your input and figure what is tic-tac-toe in this context and what exactly you expect it to do. This is why all "thinking". Ask it to implement tic-tac-toe in Python for command line. Or even just bring your own tic-tac toe code. Then make it imagine playing against you and it's gonna be fast and reliable.

▲

Filligree 4 hours ago | parent | prev | next [-]

What's the benchmark?

▲

ahmedfromtunis 4 hours ago | parent | next [-]

I don't think it would be a good idea to publish it on a prime source of training data.

▲

Hammershaft 4 hours ago | parent [-]

He could post an encrypted version and post the key with it to avoid it being trained on?

	▲	rs186 2 hours ago \| parent \| next [-]
		I wouldn't underestimate the intelligence of agentic AI, despite how stupid they are today.
	▲	benterix 4 hours ago \| parent \| prev [-]
		What makes you think it wouldn't end up in the training set anyway?

▲

petters 4 hours ago | parent | prev | next [-]

Good personal benchmarks should be kept secret :)

▲

prodigycorp 4 hours ago | parent | prev [-]

nice try!

	▲	ankit219 an hour ago \| parent [-]
		you already sent the prompt to gemini api - and they likely recorded it. So in a way they can access it anyway. Posting here or not would not matter in that aspect.

▲

luckydata 3 hours ago | parent | prev | next [-]

I'm dying to know what you're giving to it that's choking on. It's actually really impressive if that's the case.

▲

Rover222 4 hours ago | parent | prev | next [-]

curious if you tried grok 4.1 too

▲

mring33621 4 hours ago | parent | prev | next [-]

I agree that benchmarks are noise. I guess, if you're selling an LLM wrapper, you'd care, but as a happy chat end-user, I just like to ask a new model about random stuff that I'm working on. That helps me decide if I like it or not.

I just chatted with gemini-3-pro-preview about an idea I had and I'm glad that I did. I will definitely come back to it.

IMHO, the current batch of free, free-ish models are all perfectly adequate for my uses, which are mostly coding, troubleshooting and learning/research.

This is an amazing time to be alive and the AI bubble doomers that are costing me some gains RN can F-Off!

▲

mupuff1234 4 hours ago | parent | prev | next [-]

Could also just be rollout issues.

	▲	prodigycorp 4 hours ago \| parent [-]
		Could be. I'll reply to my comment later with pass/fail results of a re-run.

▲

m00dy 4 hours ago | parent | prev [-]

that's why everyone using AI for code should code in rust only.

▲ ttul 4 hours ago | parent | prev | next [-]

My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.

▲ rfw300 an hour ago | parent | next [-]

My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:

- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts

- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.

- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.

Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.

	▲	ant6n a minute ago \| parent [-]
		The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?

▲ satvikpendem 2 hours ago | parent | prev | next [-]

I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

▲ iagooar 4 hours ago | parent | prev | next [-]

What prompt do you use for that?

▲ gregsadetsky 3 hours ago | parent | next [-]

I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.

3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:

    [00:00] Greg: Hello.
    [00:01] X: You great?
    [00:02] Greg: Hi.
    [00:03] X: I'm X.
    [00:04] Y: I'm Y.
    ...

Super impressive!

▲

HPsquared 3 hours ago | parent [-]

Does it deduce everyone's name?

	▲	gregsadetsky 3 hours ago \| parent [-]
		It does! I redacted them, but yes. This was a 3-person call.

▲ punnerud 3 hours ago | parent | prev [-]

I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)

▲ renegade-otter 3 hours ago | parent | prev | next [-]

It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.

▲ valtism 3 hours ago | parent | prev [-]

Parakeet TDT v3 would be really good at that

▲ bilekas 3 hours ago | parent | prev | next [-]

> The Gemini app surpasses 650 million users per month, more than 70% of our Cloud customers use our AI, 13 million developers have built with our generative models, and that is just a snippet of the impact we’re seeing

Not to be a negative nelly, but these numbers are definitely inflated due to Google literally pushing their AI into everything they can, much like M$. Can't even search google without getting an AI response. Surely you can't claim those numbers are legit.

▲

lalitmaganti 3 hours ago | parent | next [-]

> Gemini app surpasses 650 million users per month

Unless these numbers are just lies, I'm not sure how this is "pushing their AI into everything they can". Especially on iOS where every user is someone who went to App Store and downloaded it. Admittedly on Android, Gemini is preinstalled these days but it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise.

Now OTOH "AI overviews now have two billion users" can definitely be criticised in the way you suggest.

	▲	edaemon 3 hours ago \| parent \| next [-]
		I unlocked my phone the other day and had the entire screen taken over with an ad for the Gemini app. There was a big "Get Started" button that I almost accidentally clicked because it was where I was about to tap for something else. As an Android and Google Workspace user, I definitely feel like Google is "pushing their AI into everything they can", including the Gemini app.
	▲	mewpmewp2 14 minutes ago \| parent \| prev \| next [-]
		I constantly accidentally use some btn and Gemini opens up on my Samsung Galaxy. I haven't bothered to figure this out.
	▲	aniforprez 3 hours ago \| parent \| prev \| next [-]
		I don't know for sure but they have to be counting users like me whose phone has had Gemini force installed on an update and I've only opened the app by accident while trying to figure out how to invoke the old actually useful Assistant app
	▲	realusername 3 hours ago \| parent \| prev [-]
		> it's still a choice that users are making to go there rather than being an existing product they happen to user otherwise. Yes and no, my power button got remapped to opening Gemini in an update... I removed that but I can imagine that your average user doesn't.

▲

Yizahi 3 hours ago | parent | prev | next [-]

This is benefit of bundling, I've been forecasting this for a long time - the only companies who would win the LLM race would be the megacorps bundling their offerings, and at most maybe OAI due to the sheer marketing dominance.

For example I don't pay for ChatGPT or Claude, even if they are better at certain tasks or in general. But I have Google One cloud storage sub for my photos and it comes with a Gemini Pro apparently (thanks to someone on HN for pointing it out). And so Gemini is my go to LLM app/service. I suspect the same goes for many others.

▲

joaogui1 3 hours ago | parent | prev | next [-]

It says Gemini App, not AI Overviews, AI Mode, etc

	▲	recitedropper 3 hours ago \| parent [-]
		They claim AI overviews as having "2 billion users" in the sentences prior. They are clearly trying as hard as possible to show the "best" numbers.

▲

alecco 3 hours ago | parent | prev | next [-]

Yeah my business account was forced to pay for an AI. And I only used it for a couple of weeks when Gemini 2.5 was launched, until it got nerfed. So they are definitely counting me there even though I haven't used it in like 7 months. Well, I try it once every other month to see if it's still crap, and it always is.

I hope Gemini 3 is not the same and it gives an affordable plan compared to OpenAI/Anthropic.

▲

blinding-streak 3 hours ago | parent | prev [-]

Gemini app != Google search.

You're implying they're lying?

▲

AstroBen 3 hours ago | parent [-]

And you're implying they're being 100% truthful?

Marketing is always somewhere in the middle

	▲	bitpush 2 hours ago \| parent [-]
		Companies cant get away from egregious marketing. See Apple class action lawsuit for Apple Intelligence.

▲ Workaccount2 4 hours ago | parent | prev | next [-]

It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.

Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.

Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.

▲

recitedropper 3 hours ago | parent | next [-]

Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.

▲

Workaccount2 2 hours ago | parent | next [-]

I strongly suspect it's a tokenization problem. Text and symbols fit nicely in tokens, but having something like a single "dog leg" token is a tough problem to solve.

▲

stalfie 2 hours ago | parent [-]

The neural network in the retina actually pre-processes visual information into something akin to "tokens". Basic shapes that are probably somewhat evolutionarily preserved. I wonder if we could somehow mimic those for tokenization purposes. Most likely there's someone out there already trying.

(Source: "The mind is flat" by Nick Chater)

	▲	machiaweliczny an hour ago \| parent [-]
		It's also easy to spot as when you are tired you might misrecognize objects, I caught myself with this when doing long roadtrips

▲

orly01 2 hours ago | parent | prev [-]

Why should it have to be expensive computationally? How do brains do it with such a low amount of energy? I think catching the brain abilities even of a bug might be very hard, but that does not mean that there isn't a way to do it with little computational power. It requires having the correct structures/models/algorithms or whatever is the precise jargon.

▲

recitedropper an hour ago | parent [-]

This is the million dollar question. I'm not qualified to answer it, and I don't really think anyone out there has the answer yet.

My armchair take would be that watt usage probably isn't a good proxy for computational complexity in biological systems. A good piece of evidence for this is from the C. elegans research that has found that the configuration of ions within a neuron--not just the electrical charge on the membrane--record computationally-relevant information about a stimulus. There are probably many more hacks like this that allow the brain to handle enormous complexity without it showing up in our measurements of its power consumption.

	▲	nick32661123 an hour ago \| parent [-]
		Following the trend of discovering smaller and smaller phenomena that our brains use for processing, it would not be surprising if we eventually find that our brains are very nearly "room temperature" quantum computers.

▲

lukebechtel 3 hours ago | parent | prev [-]

ah interesting. I wonder if this is a "safety guardrails blindspot" due to the placement.

▲

achow 3 hours ago | parent [-]

Op is right.

https://imgcdn.stablediffusionweb.com/2024/4/19/8e54925a-004...

For the above pic I asked "What is wrong with the image?"

Result: - It totally missed the most obvious one - six fingers.

Instead it said this:

Based on a close inspection of the image, here are the things that are "wrong" or anomalous, indicating that it is likely an AI-generated image:

- The Thumb Anatomy: This is the most obvious error. The digit in the thumb's position (far left) looks exactly like a long index finger. A natural human thumb has two segments (phalanges) and is shorter and thicker, opposing the other fingers. This digit appears to have three segments and is roughly the same length and shape as the other fingers, giving the hand the appearance of having five fingers and no thumb.

- Unnatural Skin Texture: The skin is unnaturally smooth, waxy, and free of normal human imperfections like distinct pores, fine wrinkles, or hair. It has a "plastic" or "airbrushed" look typical of generative AI.

- Palm Lines: The creases on the palm are vague and generic. They don't follow the distinct pattern of major flexion creases (like the heart, head, and life lines) found on a real human hand.

- Wrist Connection: The connection between the hand and the wrist looks slightly distorted, with tendons that appear too prominent and smooth for the pose.

[Edit: 3.0 is same as 2.5 - both answered almost identically]

	▲	evrenesat an hour ago \| parent [-]
		JFYI, Qwen managed to recognize the sixth finger: Max: https://chat.qwen.ai/c/ca671562-7a56-4e2f-911f-40c37ff3ed79 VL-235B: https://chat.qwen.ai/c/21cc5f4e-5972-4489-9787-421943335150

▲ kachapopopow 8 minutes ago | parent | prev | next [-]

It's joeover for openai and antrophic. I have been using it for 3 hours now for real work and gpt-5.1 and sonnet 4.5 (thinking) does not come close.

the token efficiency and context is also mindblowing...

it feels like I am talking to someone who can think instead of a **rider that just agrees with everything you say and then fails doing basic changes, gpt-5.1 feels particulary slow and weak in real world applications that are larger than a few dozen files.

gemini 2.5 felt really weak considering the amount of data and their proprietary TPU hardware in theory allowing them way more flexibility, but gemini 3 just works and it truly understands which is something I didn't think I'd be saying for a couple more years.

▲ stevesimmons 3 hours ago | parent | prev | next [-]

A nice Easter egg in the Gemini 3 docs [1]:

    If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:

    "thoughtSignature": "context_engineering_is_the_way_to_go"

[1] https://ai.google.dev/gemini-api/docs/gemini-3?thinking=high...

	▲	bijant 2 hours ago \| parent [-]
		It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.

▲ tylervigen 3 hours ago | parent | prev | next [-]

I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).

The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.

The ARC puzzles in question: https://arcprize.org/arc-agi/2/

▲

grantpitt 3 hours ago | parent | next [-]

Agreed, it also leads performance on arc-agi-1. Here's the leaderboard where you can toggle between arc-agi-1 and 2: https://arcprize.org/leaderboard

▲

tylervigen an hour ago | parent | prev | next [-]

This comment was moved from another thread. The original thread included a benchmark chart with ARC performance: https://blog.google/products/gemini/gemini-3/#gemini-3

▲

stephc_int13 2 hours ago | parent | prev | next [-]

What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.

Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.

But I think this is also fair to use any means to beat it.

	▲	tylervigen 2 hours ago \| parent \| next [-]
		I agree that for any given test, you could build a specific pipeline to optimize for that test. I supposed that's why it is helpful to have many tests. However, many people have worked hard to optimize tools specifically for ARC over many years, and it's proven to be a particularly hard test to optimize for. This is why I find it so interesting that LLMs can do it well at all, regardless of whether tests like it are included in training.
	▲	simpsond 2 hours ago \| parent \| prev [-]
		Humans study for tests. They just tend to forget.

▲

HarHarVeryFunny an hour ago | parent | prev [-]

There's a good chance Gemini 3 was trained on ARG-AGI problems, unless they state otherwise.

▲ syspec 2 hours ago | parent | prev | next [-]

I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.

From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.

When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.

I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard

	▲	jmkni an hour ago \| parent [-]
		I can relate to this, it's doing exactly what I want, but it ain't pretty. It's fine though if you take the time to learn what it's doing and write a nicer version of it yourself

▲ markdog12 22 minutes ago | parent | prev | next [-]

I asked it to analyze my tennis serve. It was just dead wrong. For example, it said my elbow was bent. I had to show it a still image of full extension on contact, then it admitted, after reviewing again, it was wrong. Several more issues like this. It blamed it on video being difficult. Not very useful, despite the advertisements: https://x.com/sundarpichai/status/1990865172152660047

	▲	BoorishBears 3 minutes ago \| parent \| next [-]
		The default FPS it's analyzing video at is 1, and I'm not sure the max is anywhere near enough to catch a full speed tennis serve.
	▲	strange_quark 10 minutes ago \| parent \| prev [-]
		I’ve never seen such a huge delta between advertised capabilities and real world experience. I’ve had a lot of very similar experiences to yours with these models where I will literally try verbatim something shown in an ad and get absolutely garbage results. Do these execs not use their own products? I don’t understand how they are even releasing this stuff.

▲ coffeecoders 3 hours ago | parent | prev | next [-]

Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.

Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).

▲

bitpush 2 hours ago | parent | next [-]

> Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.

One of them isnt the same as others (hint: It is Apple). The only thing Apple is doing with Maps is, is adding ads https://www.macrumors.com/2025/10/26/apple-moving-ahead-with...

▲

Workaccount2 3 hours ago | parent | prev | next [-]

AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.

▲

acoustics 3 hours ago | parent | prev [-]

Microsoft hasn't been very quiet about it, at least in my experience. Every time I boot up Windows I get some kind of blurb about an AI feature.

▲

CobrastanJorji 2 hours ago | parent [-]

Man, remember the days where we'd lose our minds at our operating systems doing stuff like that?

	▲	esafak 31 minutes ago \| parent [-]
		The people who lost their minds jumped ship. And I'm not going to work at a company that makes me use it, either. So, not my problem.

▲ crawshaw 2 hours ago | parent | prev | next [-]

Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.

▲

film42 2 hours ago | parent | next [-]

At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.

Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.

	▲	gordonhart 44 minutes ago \| parent [-]
		Matches my experience exactly. It's not the best at writing code but Gemini 2.5 Pro is (was) the hands-down winner in every other use case I have. This was hard for me to accept initially as I've learned to be anti-Google over the years, but the better accuracy was too good to pass up on. Still expecting a rugpull eventually — price hike, killing features without warning, changing internal details that break everything — but it hasn't happened yet.

▲

Szpadel 27 minutes ago | parent | prev [-]

I gave it a spin with instructions that worked great with gpt-5-codex (5.1 regressed a lot so I do not even compare to it).

Code quality was fine for my very limited tests but I was disappointed with instruction following.

I tried few tricks but I wasn't able to convince it to first present plan before starting implementation.

I have instructions describing that it should first do exploration (where it tried to discover what I want) then plan implementation and then code, but it always jumps directly to code.

this is bug issue for me especially because gemini-cli lacks plan mode like Claude code.

for codex those instructions make plan mode redundant.

▲ __jl__ 5 hours ago | parent | prev | next [-]

API pricing is up to $2/M for input and $12/M for output

For comparison: Gemini 2.5 Pro was $1.25/M for input and $10/M for output Gemini 1.5 Pro was $1.25/M for input and $5/M for output

▲

raincole 4 hours ago | parent | next [-]

Still cheaper than Sonnet 4.5: $3/M for input and $15/M for output.

▲

brianjking 4 hours ago | parent [-]

It is so impressive that Anthropic has been able to maintain this pricing still.

▲

bottlepalm 2 hours ago | parent | next [-]

Claude is just so good. Every time I try moving to ChatGPT or Gemini, they end up making concerning decisions. Trust is earned, and Claude has earned a lot of trust from me.

Honestly Google models have this mix of smart/dumb that is scary. Like if the universe is turned into paperclips then it'll probably be Google model.

	▲	epolanski 40 minutes ago \| parent [-]
		Idk Anthropic has the least consistent models out there imho.

▲

Aeolun 4 hours ago | parent | prev [-]

Because every time I try to move away I realize there’s nothing equivalent to move to.

▲

Alex-Programs 3 hours ago | parent [-]

People insist upon Codex, but it takes ages and has an absolutely hideous lack of taste.

	▲	andybak 3 hours ago \| parent [-]
		Taste in what?

▲

jhack 4 hours ago | parent | prev | next [-]

With this kind of pricing I wonder if it'll be available in Gemini CLI for free or if it'll stay at 2.5.

▲

xnx 3 hours ago | parent [-]

There's a waitlist for using Gemini 3 for Gemini CLI free users: https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...

	▲	eevmanu 2 hours ago \| parent [-]
		In case anyone wants to confirm if this link is official, it is. https://goo.gle/enable-preview-features -> https://github.com/google-gemini/gemini-cli/blob/release/v0.... --> https://goo.gle/geminicli-waitlist-signup ---> https://docs.google.com/forms/d/e/1FAIpQLScQBMmnXxIYDnZhPtTP...

▲

fosterfriends 3 hours ago | parent | prev [-]

Thrilled to see the cost is competitive with Anthropic.

▲ siva7 3 hours ago | parent | prev | next [-]

I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine). Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions). So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).

	▲	MaxL93 2 hours ago \| parent [-]
		> It seems very US centric in its thinking I'm not surprised. I'm French and one thing I've consistently seen with Gemini is that it loves to use Title Case (Everything is Capitalized Except the Prepositions) even in French or other languages where there is no such thing. A 100% american thing getting applied to other languages by the sheer power of statistical correlation (and probably being overtrained on USA-centric data). At the very least it makes it easy to tell when someone is just copypasting LLM output into some other website.

▲ senfiaj 31 minutes ago | parent | prev | next [-]

Haven't used Gemini much, but when I used, it often refused to do certain things that ChatGPT did happily. Probably because it has many things heavily censored. Obviously, a huge company like Google is under much heavier regulations than ChatGPT. Unfortunately this greatly reduces its usefulness in many situations despite that Google has more resources and computational power than OpenAI.

▲ golfer 4 hours ago | parent | prev | next [-]

Supposedly this is the model card. Very impressive results.

https://pbs.twimg.com/media/G6CFG6jXAAA1p0I?format=jpg&name=...

Also, the full document:

https://archive.org/details/gemini-3-pro-model-card/page/n3/...

▲

tweakimp 4 hours ago | parent | next [-]

Every time I see a table like this numbers go up. Can someone explain what this actually means? Is there just an improvement that some tests are solved in a better way or is this a breakthrough and this model can do something that all others can not?

	▲	rvnx 4 hours ago \| parent \| next [-]
		This is a list of questions and answers that was created by different people. The questions AND the answers are public. If the LLM manages through reasoning OR memory to repeat back the answer then they win. The scores represent the % of correct answers they recalled.
	▲	stavros 4 hours ago \| parent \| prev [-]
		I estimate another 7 months before models start getting 115% on Humanity's Last Exam.

▲

HardCodedBias 4 hours ago | parent | prev [-]

If you believe another thread the benchmarks are comparing Gemini-3 (probably thinking) to GPT-5.1 without thinking.

The person also claims that with thinking on the gap narrows considerably.

We'll probably have 3rd party benchmarks in a couple of days.

	▲	iamdelirium 3 hours ago \| parent [-]
		This is easily shown that the numbers are for GPT 5.1 thinking high. Just go to the leaderboard website and see for yourself: https://arcprize.org/leaderboard

▲ meetpateltech 4 hours ago | parent | prev | next [-]

DeepMind page: https://deepmind.google/models/gemini/

Gemini 3 Pro DeepMind Page: https://deepmind.google/models/gemini/pro/

Developer blog: https://blog.google/technology/developers/gemini-3-developer...

Gemini 3 Docs: https://ai.google.dev/gemini-api/docs/gemini-3

Google Antigravity: https://antigravity.google/

▲ zone411 2 hours ago | parent | prev | next [-]

Sets a new record on the Extended NYT Connections benchmark: 96.8 (https://github.com/lechmazur/nyt-connections/).

Grok 4 is at 92.1, GPT-5 Pro at 83.9, Claude Opus 4.1 Thinking 16K at 58.8.

Gemini 2.5 Pro scored 57.6, so this is a huge improvement.

▲ mpeg 3 hours ago | parent | prev | next [-]

Well, it just found a bug in one shot that Gemini 2.5 and GPT5 failed to find in relatively long sessions. Claude 4.5 had found it but not one shot.

Very subjective benchmark, but it feels like the new SOTA for hard tasks (at least for the next 5 minutes until someone else releases a new model)

▲ Dquiroga 24 minutes ago | parent | prev | next [-]

I asked Gemini to write "a comment response to this thread. I want to start an intense discussion".

Gemini 3:

The cognitive dissonance in this thread is staggering.

We are sitting here cheering for a model that effectively closes the loop on Google’s total information dominance, while simultaneously training our own replacements.

Two things in this thread should be terrifying, yet are being glossed over in favor of "look at the pelican SVG":

1. *The Model Card Leak:* As user `rvz` pointed out, the leaked model card implies training on "user data... in accordance with Google's relevant terms of service." Given the recent terms updates, this essentially confirms that Gemini 3 is likely refined on the corpus of Gmail and Drive data that they’ve been hoarding for decades. We are cheering for a product sold back to us at a 60% markup (input costs up to $2.00/M) that was built on our own private correspondence.

2. *The "Agentic" Trap:* Everyone praising how `Gemini 3` solves complex bugs one-shot or handles entire PRs is missing the macro view. We aren’t looking at a productivity tool; we are looking at the devaluation of the software engineer. When the entry barrier drops to "can you prompt a seminal agent," the economic value of what we do on this forum evaporates.

Google has successfully gamified us into feeding the very beast that will make the "14-minute human solve time" (referenced by `lairv`) irrelevant. We are optimizing for our own obsolescence while paying a monopoly rent to do it.

Why is the sentiment here "Wow, cool clock widget" instead of "We just handed the keys to the kingdom to the biggest ad-tech surveillance machine in history"?

▲

jaapz 22 minutes ago | parent [-]

Gotta hand it to gemini, those are some top notch points

	▲	Dquiroga 15 minutes ago \| parent [-]
		yeah hahahahah, it made me think!

▲ santhoshr 4 hours ago | parent | prev | next [-]

Pelican riding a bicycle: https://pasteboard.co/CjJ7Xxftljzp.png

▲

xnx 3 hours ago | parent | next [-]

2D SVG is old news. Next frontier is animated 3D. One shot shows there's still progress to be made: https://aistudio.google.com/apps/drive/1XA4HdqQK5ixqi1jD9uMg...

▲

nick32661123 22 minutes ago | parent | next [-]

Great improvement by only adding one feedback prompt: Change the rotation axis of the wheels by 90 degrees in the horizontal plane. Same for the legs and arms

https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

▲

agnosticmantis an hour ago | parent | prev | next [-]

This says Gemini 2.5 though.

▲

knownjorbist 3 hours ago | parent | prev | next [-]

Did you notice that this embedded a Gemini API connection within the app itself? Or am I not understanding what that is?

	▲	xnx 3 hours ago \| parent [-]
		I hadn't! It looks like that is there to power the text box at the bottom of the app that allows for AI-powered changes to the scene.

▲

Alex-Programs 3 hours ago | parent | prev [-]

Incredible. Thanks for sharing.

▲

mohsen1 4 hours ago | parent | prev | next [-]

Some time I think I should spend $50 on Upwork to get a real human artist to do it first to know what is that we're going for. What a good pelican riding a bicycle SVG is actually looking like?

	▲	AstroBen 4 hours ago \| parent [-]
		IMO it's not about art, but a completely different path than all these images are going down. The pelican needs tools to ride the bike, or a modified bike. Maybe a recumbent?

▲

robterrell 4 hours ago | parent | prev | next [-]

At this point I'm surprised they haven't been training on thousands of professionally-created SVGs of pelicans on bicycles.

▲

notatoad 4 hours ago | parent [-]

i think anything that makes it clear they've done that would be a lot worse PR than failing the pelican test would ever be.

	▲	imiric 3 hours ago \| parent [-]
		It would be next to impossible for anyone without insider knowledge to prove that to be the case. Secondly, benchmarks are public data, and these models are trained on such large amounts of it that it would be impractical to ensure that some benchmark data is not part of the training set. And even if it's not, it would be safe to assume that engineers building these models would test their performance on all kinds of benchmarks, and tweak them accordingly. This happens all the time in other industries as well. So the pelican riding a bicycle test is interesting, but it's not a performance indicator at this point.

▲

bn-l 4 hours ago | parent | prev [-]

It’s a good pelican. Not great but good.

	▲	cubefox 32 minutes ago \| parent [-]
		The blue lines indicating wind really sell it.

▲ sd9 5 hours ago | parent | prev | next [-]

How long does it typically take after this to become available on https://gemini.google.com/app ?

I would like to try the model, wondering if it's worth setting up billing or waiting. At the moment trying to use it in AI Studio (on the Free tier) just gives me "Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

▲

mpeg 5 hours ago | parent | next [-]

Allegedly it's already available in stealth mode if you choose the "canvas" tool and 2.5. I don't know how true that is, but it is indeed pumping out some really impressive one shot code

Edit: Now that I have access to Gemini 3 preview, I've compared the results of the same one shot prompts on the gemini app's 2.5 canvas vs 3 AI studio and they're very similar. I think the rumor of a stealth launch might be true.

	▲	sd9 4 hours ago \| parent [-]
		Thanks for the hint about Canvas/2.5. I have access to 3.0 in AI Studio now, and I agree the results are very similar.

▲

netdur 36 minutes ago | parent | prev | next [-]

On gemini.google.com, I see options labeled 'Fast' and 'Thinking.' The 'Thinking' option uses Gemini 3 Pro

▲

magicalhippo 4 hours ago | parent | prev | next [-]

> https://gemini.google.com/app

How come I can't even see prices without logging in... they doing regional pricing?

▲

Romario77 3 hours ago | parent | prev | next [-]

It's available in cursor. Should be there pretty soon as well.

	▲	ionwake 2 hours ago \| parent [-]
		are you sure its available in cursor? ( I get: We're having trouble connecting to the model provider. This might be temporary - please try again in a moment. )

▲

Squarex 5 hours ago | parent | prev | next [-]

Today I guess. They were not releasing the preview models this time and it seems the want to synchronize the release.

▲

csomar 4 hours ago | parent | prev [-]

It's already available. I asked it "how smart are you really?" and it gave me the same ai garbage template that's now very common on blog posts: https://gist.githubusercontent.com/omarabid/a7e564f09401a64e...

▲ beezlewax 4 minutes ago | parent | prev | next [-]

Can't wait til Gemini 4 is out!

▲ svantana 4 hours ago | parent | prev | next [-]

Grok got to hold the top spot of LMArena-text for all of ~24 hours, good for them [1]. With stylecontrol enabled, that is. Without stylecontrol, gemini held the fort.

[1] https://lmarena.ai/leaderboard/text

	▲	inkysigma 3 hours ago \| parent \| next [-]
		Is it just me or is that link broken because of the cloudflare outage? Edit: nvm it looks to be up for me again
	▲	dyauspitr 2 hours ago \| parent \| prev [-]
		Grok is heavily censored though

▲ creddit an hour ago | parent | prev | next [-]

Gemini 3 is crushing my personal evals for research purposes.

I would cancel my ChatGPT sub immediately if Gemini had a desktop app and may still do so if it continues to impress my as much as it has so far and I will live without the desktop app.

It's really, really, really good so far. Wow.

Note that I haven't tried it for coding yet!

	▲	ethmarks 20 minutes ago \| parent [-]
		Genuinely curious here: why is the desktop app so important? I completely understand the appeal of having local and offline applications, but the ChatGPT desktop app doesn't work without an internet connection anyways. Is it just the convenience? Why is a dedicated desktop app so much better than just opening a browser tab or even using a PWA? Also, have you looked into open-webui or Msty or other provider-agnostic LLM desktop apps? I personally use Msty with Gemini 2.5 Pro for complex tasks and Cerebras GLM 4.6 for fast tasks.

▲ recitedropper an hour ago | parent | prev | next [-]

Who wants to bet they benchmaxxed ARC-AGI-2? Nothing in their release implies they found some sort of "secret sauce" that justifies the jump.

Maybe they are keeping that itself secret, but more likely they probably just have had humans generate an enormous number of examples, and then synthetically build on that.

No benchmark is safe, when this much money is on the line.

	▲	sosodev 36 minutes ago \| parent \| next [-]
		Here's some insight from Jeff Dean and Noam Shazeer's interview with Dwarkesh Patel https://youtu.be/v0gjI__RyCY&t=7390 > When you think about divulging this information that has been helpful to your competitors, in retrospect is it like, "Yeah, we'd still do it," or would you be like, "Ah, we didn't realize how big a deal transformer was. We should have kept it indoors." How do you think about that? > Some things we think are super critical we might not publish. Some things we think are really interesting but important for improving our products; We'll get them out into our products and then make a decision.
	▲	HarHarVeryFunny 43 minutes ago \| parent \| prev \| next [-]
		I'd also be curious what kind of tools they are providing to get the jump from Pro to Deep Think (with tools) performance. ARC-AGI specialized tools?
	▲	horhay 35 minutes ago \| parent \| prev [-]
		They ran the tests themselves only on semi-private evals. Basically the same caveat as when o3 supposedly beat ARC1

▲ realty_geek 36 minutes ago | parent | prev | next [-]

I would like to try controlling my browser with this model. Any ideas how to do this. Ideally I would like something like openAI's atlas or perplexity's comet but powered by gemini 3.

	▲	ZeroCool2u 10 minutes ago \| parent [-]
		Seems like their new Antigravity IDE specifically has this built in. https://antigravity.google/docs/browser

▲ mil22 5 hours ago | parent | prev | next [-]

It's available to be selected, but the quota does not seem to have been enabled just yet.

"Failed to generate content, quota exceeded: you have reached the limit of requests today for this model. Please try again tomorrow."

"You've reached your rate limit. Please try again later."

Update: as of 3:33 PM UTC, Tuesday, November 18, 2025, it seems to be enabled.

▲

sarreph 5 hours ago | parent | next [-]

Looks to be available in Vertex.

I reckon it's an API key thing... you can more explicitly select a "paid API key" in AI Studio now.

▲

CjHuber 4 hours ago | parent | prev | next [-]

For me it’s up and running. I was doing some work with AI Studio when it was released and reran a few prompts already. Interesting also that you can now set thinking level low or high. I hope it does something, in 2.5 increasing maximum thought tokens never made it think more

▲

lousken 5 hours ago | parent | prev | next [-]

I hope some users will switch from cerebras to free up those resources

▲

r0fl 4 hours ago | parent | prev | next [-]

Works for me.

▲

misiti3780 5 hours ago | parent | prev [-]

seeing the same issue.

	▲	sottol 5 hours ago \| parent [-]
		you can bring your google api key to try it out, and google used to give $300 free when signing up for billing and creating a key. when i signed up for billing via cloud console and entered my credit card, i got $300 "free credits". i haven't thrown a difficult problem at gemini 3 pro it yet, but i'm sure i got to see it in some of the A/B tests in aistudio for a while. i could not tell which model was clearly better, one was always more succinct and i liked its "style" but they usually offered about the same solution.

▲ nickandbro 4 hours ago | parent | prev | next [-]

What we have all been waiting for:

"Create me a SVG of a pelican riding on a bicycle"

https://www.svgviewer.dev/s/FfhmhTK1

▲

Thev00d00 4 hours ago | parent | next [-]

That is pretty impressive.

So impressive it makes you wonder if someone has noticed it being used a benchmark prompt.

▲

burkaman 4 hours ago | parent | next [-]

Simon says if he gets a suspiciously good result he'll just try a bunch of other absurd animal/vehicle combinations to see if they trained a special case: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

▲

jmmcd 4 hours ago | parent | next [-]

"Pelican on bicycle" is one special case, but the problem (and the interesting point) is that with LLMs, they are always generalising. If a lab focussed specially on pelicans on bicycles, they would as a by-product improve performance on, say, tigers on rollercoasters. This is new and counter-intuitive to most ML/AI people.

	▲	BoorishBears 44 minutes ago \| parent [-]
		The gold standard for cheating on a benchmark is SFT and ignoring memorization. That's why the standard for quickly testing for benchmark contamination has always been to switch out specifics of the task. Like replacing named concepts with nonsense words in reasoning benchmarks.

▲

ddalex 4 hours ago | parent | prev [-]

https://www.svgviewer.dev/s/TVk9pqGE giraffe in a ferrari

▲

rixed 4 hours ago | parent | prev [-]

I have tried combinations of hard to draw vehicle and animals (crocodile, frog, pterodactly, riding a hand glider, tricycle, skydiving), and it did a rather good job in every cases (compared to previous tests). Whatever they have done to improve on that point, they did it in a way that generalise.

▲

bitshiftfaced 4 hours ago | parent | prev [-]

It hadn't occurred to me until now that the pelican could overcome the short legs issue by not sitting on the seat and instead put its legs inside the frame of the bike. That's probably closer to how a real pelican would ride a bike, even if it wasn't deliberate.

	▲	xnx 4 hours ago \| parent [-]
		Very aero

▲ yomismoaqui 2 hours ago | parent | prev | next [-]

From an initial testing of my personal benchmark it works better than Gemini 2.5 pro.

My use case is using Gemini to help me test a card game I'm developing. The model simulates the board state and when the player has to do something it asks me what card to play, discard... etc. The game is similar to something like Magic the Gathering or Slay the Spire with card play inspired by Marvel Champions (you discard cards from your hand to pay the cost of a card and play it)

The test is just feeding the model the game rules document (markdown) with a prompt asking it to simulate the game delegating the player decisions to me, nothing special here.

It seems like it forgets rules less than Gemini 2.5 Pro using thinking budget to max. It's not perfect but it helps a lot to test little changes to the game, rewind to a previous turn changing a card on the fly, etc...

▲ qustrolabe 3 hours ago | parent | prev | next [-]

Out of all other companies Google provide the most generous free access so far. I bet this gives them plenty of data to train even better models

▲ Retr0id 27 minutes ago | parent | prev | next [-]

> it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month

Do regular users know how to disable AI Overviews, if they don't love them?

	▲	jeron 23 minutes ago \| parent [-]
		it's as low tech as using adblock - select element and block

▲ nighwatch 2 hours ago | parent | prev | next [-]

I just tested the Gemini 3 preview as well, and its capabilities are honestly surprising. As an experiment I asked it to recreate a small slice of Zelda , nothing fancy, just a mock interface and a very rough combat scene. It managed to put together a pretty convincing UI using only SVG, and even wired up some simple interactions.

It’s obviously nowhere near a real game, but the fact that it can structure and render something that coherent from a single prompt is kind of wild. Curious to see how far this generation can actually go once the tooling matures.

▲ I_am_tiberius 32 minutes ago | parent | prev | next [-]

I still need a google account to use it and it always asks me for a phone verification, which I don't want to give to google. That prevents me from using Gemini. I would even pay for it.

▲

gpm 28 minutes ago | parent [-]

> I would even pay for it.

Is it just me or is it generally the case that to pay for anything on the internet you have to enter credit card information including a phone number.

▲

I_am_tiberius 25 minutes ago | parent [-]

You never have to add your phone number in order to pay.

	▲	gpm 21 minutes ago \| parent [-]
		While I haven't tried leaving the field blank on every credit card form I've come across, I'm certain that at least some of them considered it required. Perhaps its country specific?

▲ JacobiX an hour ago | parent | prev | next [-]

Tested it on a bug that Claude and ChatGPT Pro struggled with, it nailed it, but only solved it partially (it was about matching data using a bipartite graph). Another task was optimizing a complex SQL script: the deep-thinking mode provided a genuinely nuanced approach using indexes and rewriting parts of the query. ChatGPT Pro had identified more or less the same issues. For frontend development, I think it’s obvious that it’s more powerful than Claude Code, at least in my tests, the UIs it produces are just better. For backend development, it’s good, but I noticed that in Java specifically, it often outputs code that doesn’t compile on the first try, unlike Claude.

▲

skrebbel an hour ago | parent [-]

> it nailed it, but only solved it partially

Hey either it nailed it or it didn't.

	▲	JacobiX an hour ago \| parent \| next [-]
		Yes; they nailed the root case but the implementation is not 100% correct
	▲	joaogui1 an hour ago \| parent \| prev [-]
		Probably figured out the exact cause of the bug but not how to solve it

▲ mrinterweb an hour ago | parent | prev | next [-]

Hit the Gemini 3 quota on the second prompt in antigravity even though I'm a pro user. I highly doubt I hit a context window based on my prompt. Hopefully, it is just first day of near general availability jitters.

▲ bespokedevelopr 3 hours ago | parent | prev | next [-]

Wow so the polymarket insider bet was true then..

https://old.reddit.com/r/wallstreetbets/comments/1oz6gjp/new...

▲

giarc 3 hours ago | parent | next [-]

These prediction markets are so ripe for abuse it's unbelievable. People need to realize there are real people on the other side of these bets. Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call. These types of bets shouldn't be allowed.

▲

ATMLOTTOBEER 39 minutes ago | parent | next [-]

It’s not really abuse though. These markets aggregate information; when an insider takes one side of a trade, they are selling their information about the true price (probability of the thing happening) to the market (and the price will move accordingly).

You’re spot on that people should think of who is on the other side of the trades they’re taking, and be extremely paranoid of being adversely selected.

Disallowing people from making terrible trades seems…paternalistic? Idk

▲

ethmarks 3 hours ago | parent | prev | next [-]

The point of prediction markets isn't to be fair. They are not the stock market. The point of prediction markets is to predict. They provide a monetary incentive for people who are good at predicting stuff. Whether that's due to luck, analysis, insider knowledge, or the ability to influence the result is irrelevant. If you don't want to participate in an unfair market, don't participate in prediction markets.

▲

Dilettante_ an hour ago | parent | prev | next [-]

>Brian Armstong, CEO of Coinbase intentionally altered the outcome of a bet by randomly stating "Bitcoin, Ethereum, blockchain, staking, Web3" at the end of an earnings call.

For the kind of person playing these sorts of games, that actually really "hype".

▲

HDThoreaun 3 hours ago | parent | prev | next [-]

I’m pretty sure that these model release date markets are made to be abused. They’re just a way to pay insiders to tell you when the model will be released.

The mention markets are pure degenerate gambling and everyone involved knows that

	▲	ATMLOTTOBEER 35 minutes ago \| parent [-]
		Correct, and this is actually how all markets work in the sense that they allow for price discovery :)

▲

FergusArgyll an hour ago | parent | prev [-]

Abuse sounds bad, this is good! Now we have a sneak peek into the future, for free! Just don't bet on any markets where an insider has knowledge (or don't bet at all)

▲

fresh_broccoli 35 minutes ago | parent | prev [-]

In hindsight, one possible reason to bet on November 18 was the deprecation date of older models: https://www.reddit.com/r/singularity/comments/1oom1lq/google...

▲ icyfox 4 hours ago | parent | prev | next [-]

Pretty happy the under 200k token pricing is staying in the same ballpark as Gemini 2.5 Pro:

Input: $1.25 -> $2.00 (1M tokens)

Output: $10.00 -> $12.00

Squeezes a bit more margin out of app layer companies, certainly, but there's a good chance that for tasks that really require a sota model it can be more than justified.

▲

rudedogg 3 hours ago | parent [-]

Every recent release has bumped the pricing significantly. If I was building a product and my margins weren’t incredible I’d be concerned. The input price almost doubled with this one.

	▲	icyfox 3 hours ago \| parent [-]
		I'm not sure how concerned people should be at the trend lines. If you're building a product that already works well, you shouldn't feel the need to upgrade to a larger parameter model. If your product doesn't work and the new architectures unlock performance that would let you have a feasible business, even a 2x on input tokens shouldn't be the dealbreaker. If we're paying more for a more petaflop heavy model, it makes sense that costs would go up. What really would concern me is if companies start ratcheting prices up for models with the same level of performance. My hope is raw hardware costs and OSS releases keep a lid on the margin pressure.

▲ CephalopodMD an hour ago | parent | prev | next [-]

What I'm getting from this thread is that people have their own private benchmarks. It's almost a cottage industry. Maybe someone should crowd source those benchmarks, keep them completely secret, and create a new public benchmark of people's private AGI tests. All they should release for a given model is the final average score.

▲ ponyous 4 hours ago | parent | prev | next [-]

Can’t wait to test it out. Been running a tons of benchmarks (1000+ generations) for my AI to CAD model project and noticed:

- GPT-5 medium is the best

- GPT-5.1 falls right between Gemini 2.5 Pro and GPT-5 but it’s quite a bit faster

Really wonder how well Gemini 3 will perform

▲ GodelNumbering 4 hours ago | parent | prev | next [-]

And of course they hiked the API prices

Standard Context(≤ 200K tokens)

Input $2.00 vs $1.25 (Gemini 3 pro input is 60% more expensive vs 2.5)

Output $12.00 vs $10.00 (Gemini 3 pro output is 20% more expensive vs 2.5)

Long Context(> 200K tokens)

Input $4.00 vs $2.50 (same +60%)

Output $18.00 vs $15.00 (same +20%)

▲

panarky 4 hours ago | parent | next [-]

Claude Opus is $15 input, $75 output.

▲

CjHuber 4 hours ago | parent | prev [-]

Is it the first time long context has separate pricing? I hadn’t encountered that yet

▲

1ucky 4 hours ago | parent | next [-]

Anthropic is also doing this for long context >= 200k Tokens on Sonnet 4.5

▲

Topfi 4 hours ago | parent | prev | next [-]

Google has been doing that for a while.

▲

brianjking 4 hours ago | parent | prev [-]

Google has always done this.

	▲	CjHuber 4 hours ago \| parent [-]
		Ok wow then I‘ve always overlooked that.

▲ zone411 an hour ago | parent | prev | next [-]

Sets a new record on the Extended NYT Connections: 96.8. Gemini 2.5 Pro scored only 57.6. https://github.com/lechmazur/nyt-connections/

▲ oceanplexian an hour ago | parent | prev | next [-]

Suspicious that none of the benchmarks include Chinese models even they scored higher on the benchmarks than the models they are comparing to?

▲ briga 3 hours ago | parent | prev | next [-]

Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.

But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.

	▲	stephc_int13 2 hours ago \| parent [-]
		Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on. Given the nature of how those models work, you don't need exact replicas.

▲ gertrunde 4 hours ago | parent | prev | next [-]

"AI Overviews now have 2 billion users every month."

"Users"? Or people that get presented with it and ignore it?

▲

singhrac 3 hours ago | parent | next [-]

They're a bit less bad than they used to be. I'm not exactly happy about what this means to incentives (and rewards) for doing research and writing good content, but sometimes I ask a dumb question out of curiosity and Google overview will give it to me (e.g. "what's in flower food?"). I don't need GPT 5.1 Thinking for that.

▲

mNovak 2 hours ago | parent | prev | next [-]

Maybe you ignore it, but Google has stated in the past that click-through rates with AI overviews are way down. To me, that implies the 'user' read the summary and got what they needed, such that they didn't feel the need to dig into a further site (ignoring whether that's a good thing or not).

I'd be comfortable calling a 'user' anyone who clicked to expand the little summary. Not sure what else you'd call them.

	▲	gertrunde an hour ago \| parent [-]
		You're right, I'm probably being a little uncharitable! Normal users (i.e. not grumpy techies ;) ) probably just go with the flow rather than finding it irritating.

▲

recitedropper 3 hours ago | parent | prev [-]

"Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month."

Cringe. To get to 2 billion a month they must be counting anyone who sees an AI overview as a user. They should just go ahead and claim the "most quickly adopted product in history" as well.

▲ t_minus_40 20 minutes ago | parent | prev | next [-]

is there even a puzzle or math problem gemini 3 cant solve?

▲ aliljet 4 hours ago | parent | prev | next [-]

When will this be available in the cli?

▲

_ryanjsalva 4 hours ago | parent | next [-]

Gemini CLI team member here. We'll start rolling out today.

	▲	evandena 2 hours ago \| parent \| next [-]
		How about for Pro (not Ultra) subscribers?
	▲	aliljet 4 hours ago \| parent \| prev [-]
		This is the heroic move everyone is waiting for. Do you know how this will be priced?

▲

Sammi 4 hours ago | parent | prev [-]

I'm already seeing it in https://aistudio.google.com/

▲ AstroBen an hour ago | parent | prev | next [-]

First impression is I'm having a distinctly harder time getting this to stick to instructions as compared to Gemini 2.5

▲ mikeortman 3 hours ago | parent | prev | next [-]

Its available for me now in gemini.google.com.... but its failing so bad at accurate audio transcription.

Its transcribing the meeting but hallucinates badly... both in fast and thinking mode. Fast mode only transcribed about a fifth of the meeting before saying its done. Thinking mode completely changed the topic and made up ENTIRE conversations. Gemini 2.5 actually transcribed it decently, just occasional missteps when people talked over each other.

I'm concerned.

▲ zurfer 4 hours ago | parent | prev | next [-]

It also tops LMSYS leaderboard across all categories. However knowledge cutoff is Jan 2025. I do wonder how long they have been pre-training this thing :D.

	▲	mudkipdev 3 hours ago \| parent [-]
		Isn't it the same cutoff as 2.5?

▲ energy123 2 hours ago | parent | prev | next [-]

Impressive. Although the Deep Think benchmark results are suspicious given they're comparing apples (tools on) with oranges (tools off) in their chart to visually show an improvement.

▲ sunaookami 3 hours ago | parent | prev | next [-]

Gemini CLI crashes due to this bug: https://github.com/google-gemini/gemini-cli/issues/13050 and when applying the fix in the settings file I can't login with my Google account due to "The authentication did not complete successfully. The following products are not yet authorized to access your account" with useless links to completely different products (Code Assist).

Antigravity uses Open-VSX and can't be configured differently even though it says it right there (setting is missing). Gemini website still only lists 2.5 Pro. Guess I will just stick to Claude.

▲ scrollop 3 hours ago | parent | prev | next [-]

Here it makes a text based video editor that works:

https://youtu.be/MPjOQIQO8eQ?si=wcrCSLYx3LjeYDfi&t=797

▲ nilsingwersen 5 hours ago | parent | prev | next [-]

Feeling great to see something confidential

▲ alksdjf89243 an hour ago | parent | prev | next [-]

Pretty obvious how contaminated this site is with goog employees upvoting nonsense like this.

▲ RobinL 5 hours ago | parent | prev | next [-]

- Anyone have any idea why it says 'confidential'?

- Anyone actually able to use it? I get 'You've reached your rate limit. Please try again later'. (That said, I don't have a paid plan, but I've always had pretty much unlimited access to 2.5 pro)

[Edit: working for me now in ai studio]

▲ thedelanyo 4 hours ago | parent | prev | next [-]

Reading the introductory passage - all I can say now is, Ai is here to stay.

▲ John-Tony 2 hours ago | parent | prev | next [-]

The Gemini 3 Pro Preview in Google AI Studio is a big deal — it's the latest multimodal model with stronger reasoning and massive context window (~1M tokens), now available for early testing.

▲ ilaksh 2 hours ago | parent | prev | next [-]

okay since Gemini 3 is AI mode now, I switched from the free perplexity back to google as being my search default.

▲ CjHuber 4 hours ago | parent | prev | next [-]

Interesting that they added an option to select your own API key right in AI studio‘s input field. I sincerely hope the times of generous free AIstudio usage are not over

▲ serjester 3 hours ago | parent | prev | next [-]

It's disappointing there's no flash / lite version - this is where Google has excelled up to this point.

▲

aoeusnth1 3 hours ago | parent [-]

Maybe they're slow rolling the announcements to be in the news more

▲

coffeebeqn 2 hours ago | parent [-]

Most likely. And/or they use the full model to train the smaller ones somehow

	▲	FergusArgyll an hour ago \| parent [-]
		The term of art is distillation

▲ pflenker 3 hours ago | parent | prev | next [-]

> Since then, it’s been incredible to see how much people love it. AI Overviews now have 2 billion users every month.

Come on, you can’t be serious.

▲ NullCascade 3 hours ago | parent | prev | next [-]

I'm not a mathematician but I think we underestimate how useful pure mathematics can be to tell whether we are approaching AGI.

Can the mathematicians here try ask it to invent new novel math related to [Insert your field of specialization] and see if it comes up with something new and useful?

Try lowering the temperature, use SymPy etc.

	▲	ducttapecrown 2 hours ago \| parent [-]
		Terry Tao is writing about this on his blog.

▲ testfrequency an hour ago | parent | prev | next [-]

I continue to not use Gemini as I can’t have my data not trained but also have chat history at the same time.

Yes, I know the Workspaces workaround, but that’s silly.

▲ BoorishBears an hour ago | parent | prev | next [-]

So they won't release multimodal or Flash at launch, but I'm guessing people who blew smoke up the right person's backside on X are already building with it

Glad to see Google still can't get out of its own way.

▲ DeathArrow 4 hours ago | parent | prev | next [-]

It generated a quite cool pelican on a bike: https://imgur.com/a/yzXpEEh

	▲	rixed 4 hours ago \| parent [-]
		2025: solve the biking pelican problem 2026: cure cancer

▲ XCSme 3 hours ago | parent | prev | next [-]

How's the pelican?

▲ samuelknight 4 hours ago | parent | prev | next [-]

"Gemini 3 Pro Preview" is in Vertex

▲ guluarte 5 hours ago | parent | prev | next [-]

it is live in the api

> gemini-3-pro-preview-ais-applets

> gemini-3-pro-preview

	▲	spudlyo 3 hours ago \| parent [-]
		Can confirm. I was able to access it using GPTel in Emacs using 'gemini-3-pro-preview' as the model name.

▲ skerit 4 hours ago | parent | prev | next [-]

Not the preview crap again. Haven't they tested it enough? When will it be available in Gemini-CLI?

	▲	CjHuber 4 hours ago \| parent [-]
		Honestly I liked 2.5 Pro preview much more than the final version

▲ Der_Einzige 4 hours ago | parent | prev | next [-]

When will they allow us to use modern LLM samplers like min_p, or even better samplers like top N sigma, or P-less decoding? They are provably SOTA and in some cases enable infinite temperature.

Temperature continues to be gated to maximum of 0.2, and there's still the hidden top_k of 64 that you can't turn off.

I love the google AI studio, but I hate it too for not enabling a whole host of advanced features. So many mixed feelings, so many unanswered questions, so many frustrating UI decisions on a tool that is ostensibly aimed at prosumers...

▲ nextworddev 2 hours ago | parent | prev | next [-]

It’s over for Anthropic. That’s why Google’s cool with Claude being on Azure.

Also probably over for OpenAI

▲ casey2 3 hours ago | parent | prev | next [-]

The first paragraph is pure delusion. Why do investors like delusional CEOs so much? I would take it as a major red flag.

▲ denysvitali 4 hours ago | parent | prev | next [-]

Finally!

▲ rvz 4 hours ago | parent | prev | next [-]

I expect almost no-one to read the Gemini 3 model card. But here is a damning excerpt from the early leaked model card from [0]:

> The training dataset also includes: publicly available datasets that are readily downloadable; data obtained by crawlers; licensed data obtained via commercial licensing agreements; user data (i.e., data collected from users of Google products and services to train AI models, along with user interactions with the model) in accordance with Google’s relevant terms of service, privacy policy, service-specific policies, and pursuant to user controls, where appropriate; other datasets that Google acquires or generates in the course of its business operations, or directly from its workforce; and AI-generated synthetic data.

So your Gmails are being read by Gemini and is being put on the training set for future models. Oh dear and Google is being sued over using Gemini for analyzing user's data which potentially includes Gmails by default.

Where is the outrage?

[0] https://web.archive.org/web/20251118111103/https://storage.g...

[1] https://www.yahoo.com/news/articles/google-sued-over-gemini-...

▲

inkysigma 3 hours ago | parent | next [-]

Isn't Gmail covered under the Workspace privacy policy which forbids using that for training data. So I'm guessing that's excluded by the "in accordance" clause.

	▲	andrewinardeer 2 hours ago \| parent [-]
		The real question is, "For how long?"

▲

recitedropper 3 hours ago | parent | prev | next [-]

I'm pretty sure they mention in their various TOSes that they don't train on user data in places like Gmail.

That said, LLMs are the most data-greedy technology of all time, and it wouldn't surprise me that companies building them feel so much pressure to top each other they "sidestep" their own TOSes. There are plenty of signals they are already changing their terms to train when previously they said they wouldn't--see Anthropic's update in August regarding Claude Code.

If anyone ever starts caring about privacy again, this might be a way to bring down the crazy AI capex / tech valuations. It is probably possible, if you are a sufficiently funded and motivated actor, to tease out evidence of training data that shouldn't be there based on a vendor's TOS. There is already evidence some IP owners (like NYT) have done this for copyright claims, but you could get a lot more pitchforks out if it turns out Jane Doe's HIPAA-protected information in an email was trained on.

▲

stefs 3 hours ago | parent | prev | next [-]

i'm very doubtful gmail mails are used to train the model by default, because emails contain private data and as soon as this private data shows up in the model output, gmail is done.

"gmail being read by gemini" does NOT mean "gemini is trained on your private gmail correspondence". it can mean gemini loads your emails into a session context so it can answer questions about your mail, which is quite different.

▲

Yizahi 2 hours ago | parent | prev | next [-]

By the year 2025 I think most of the HN regulars and IT people in general are so jaded regarding privacy that it is not even surprising anyone. I suspect all gmails were analyzed and read from the beginning of google age, so nothing really changed, they might as well just admit it.

Google is betting that moving email and cloud is such a giant hassle that almost no one will do it, and ditching YT and Maps is just impossible.

▲

aoeusnth1 3 hours ago | parent | prev [-]

This seems like a dubious conclusion. I think you missed this part:

> in accordance with Google’s relevant terms of service, privacy policy

▲ mihau 4 hours ago | parent | prev | next [-]

@simonw wen pelican

▲ poemxo 2 hours ago | parent | prev | next [-]

It's amazing to see Google take the lead while OpenAI worsens their product every release.

▲ WXLCKNO 3 hours ago | parent | prev | next [-]

Valve could learn from Google here

▲ informal007 5 hours ago | parent | prev | next [-]

It seem that Google doesn't prepare well to release Gemini 3 but leak many contents, include the model card early today and gemini 3 on aistudio.google.com

▲ John-Tony12 an hour ago | parent | prev | next [-]

Google’s Gemini 3, per their blog, is their most advanced AI yet — nuanced reasoning, powerful multimodal understanding, and an experimental “Deep Think” mode. Available now in Gemini app, AI Studio, Vertex AI, and Google Antigravity.

▲ irthomasthomas 3 hours ago | parent | prev [-]

I asked it to summarize an article about the Zizians which mentions Yudkowsky SEVEN times. Gemini-3 did not mention him once. Tried it ten times and got zero mention of Yudkowsky, despite him being a central figure in the story. https://xcancel.com/xundecidability/status/19908286970881311...

Also, can you guess which pelican SVG was gemini 3 vs 2.5? https://xcancel.com/xundecidability/status/19908113191723213...

▲ stickfigure 3 hours ago | parent | next [-]

He's not a central figure in the narrative, he's a background character. Things he created (MIRI, CFAR, LessWrong) are important to the narrative, the founder isn't. If I had to condense the article, I'd probably cut him out too. Summarization is inherently lossy.

▲ irthomasthomas 3 hours ago | parent [-]

  > Eliezer Yudkowsky is a central figure in the article, mentioned multiple times as the intellectual originator of the community from which the "Zizians" splintered. His ideas and organizations are foundational to the entire narrative.

▲

stickfigure 2 hours ago | parent | next [-]

And yet you could eliminate him entirely and the story is still coherent.

The story isn't about Yudkowsky. At each level of summarization you have to make hard decisions about what to keep. Not every story about the United States needs to mention George Washington.

▲

Dilettante_ an hour ago | parent | prev [-]

You're absolutely right! The AI said it, so it must be true!

▲

irthomasthomas an hour ago | parent [-]

At least read what you respond to... Imagine thinking Yudkowsky was NOT a central figure in the Zizians story.

▲

Dilettante_ an hour ago | parent [-]

You literally quoted the LLMs output verbatim as your proof.

Edit: And upon skimming the article at the points where Yudkowsky's name is mentioned, I 100% agree with stickfigure.

I challenge you to name one way in which the story falls apart without the mention of Yudkowsky.

▲

irthomasthomas 32 minutes ago | parent [-]

It sounds like both of you are unfamiliar with the link between the Zizians and Yudkowsky. So let us just return to the discussion of gemini-3, do you think the model did a bad job then in it's second response?

	▲	Dilettante_ 16 minutes ago \| parent [-]
		It literally does not matter how much they are connected out here in reality, the AI was to summarize the information in the article and that is exactly what it did. >do you think the model did a bad job then in it's second response Yes, very obviously it told you what you wanted to hear. This is behavior that should not be surprising to you.

▲ gregsadetsky 3 hours ago | parent | prev | next [-]

Interesting, yeah! Just tried "summarize this story and list the important figures from it" with Gemini 2.5 Pro and 3 and they both listed 10 names each, but without including Yudkowsky.

Asking the follow up "what are ALL the individuals mentioned in the story" results in both models listing ~40 names and both of those lists include Yudkowsky.

▲ briga 3 hours ago | parent | prev [-]

Maybe it has guard rails against such things? That would be my main guess on the Zizian one.