Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

alphabetting 2 hours ago | parent | next [-]

the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.

For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:

1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%

	▲	HardCodedBias 29 minutes ago \| parent [-]
		LOL come on man. Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta). If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.

▲

onlyrealcuzzo 2 hours ago | parent | prev | next [-]

Because Search is not agentic.

Most of Gemini's users are Search converts doing extended-Search-like behaviors.

Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

▲

Macha 2 hours ago | parent [-]

> Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

I do wonder what percentage of revenue they are. I expect it's very outsized relative to usage (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results)

	▲	onlyrealcuzzo an hour ago \| parent \| next [-]
		> (e.g. approximately nobody who is receiving them is paying for those summaries at the top of search results) Nobody is paying for Search. According to Google's earnings reports - AI Overviews is increasing overall clicks on ads and overall search volume.
	▲	curly6 an hour ago \| parent \| prev [-]
		[dead]

▲

ionwake 2 hours ago | parent | prev [-]

Can you explain what you mean by its bad at agentic stuff?

	▲	karmasimida 2 hours ago \| parent [-]
		Accomplish the task I give to it without fighting me with it. I think this is classic precision/recall issue: the model needs to stay on task, but also infer what user might want but not explicitly stated. Gemini seems particularly bad that recall, where it goes out of bounds