Remix.run Logo
ziotom78 2 hours ago

I am a physics professor and often use Gemini to check my papers. It is a formidable tool: it was able to find a clerical error (a missing imaginary unit in a complex mathematical expression) I was not able to find for days, and it often underlines connections between concepts and ideas that I overlooked.

However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring.

Quothling 3 minutes ago | parent | next [-]

We've got a rather extensive AI setup through our equity fund and I've setup a group of agents for data architecture at scale. One is the main agent I discuss with and it's setup to know our infrastructure and has access to image generation tools, websearch, hand off agents and other things. I tend to use Opus (4-6 currently) and I find it to be rather great. As you point out it comes with the danger of making mistakes, and again, as you point out, it's not an issue for things I'm an expert on. What I rely on it for, however, is analysing how specific tools would fit into our architecture. In the past you would likely have hired a group of consultants to do this research, but now you can have an AI agent tell you what the advantages and disadvantages of Microsoft Fabric in your setup. Since I don't know the capabilities of Fabric I can't tell if the AI gives me the correct analysis of a Lakehouse and a Warehouse (fabric tools).

What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify.

It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy.

nopinsight 2 hours ago | parent | prev | next [-]

I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes.

Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon.

You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems:

https://critpt.com/

Frontier models are still nowhere near solving it, but progress has been rapid.

* o3 (high) <1.5 years ago was at 1.4%

* GPT 5.4 (xhigh), 23.4%

* GPT-5.5 (xhigh), 27.1%

* GPT-5.5 Pro (xhigh) 30.6%.

https://artificialanalysis.ai/evaluations/critpt.

civvv 6 minutes ago | parent [-]

There are many indications that model progress is slowing down, so that is not entirely accurate.

maximamas an hour ago | parent | prev | next [-]

LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine.

ziotom78 9 minutes ago | parent | next [-]

I agree, but I would add that they can be very useful even if you do not have clear expectations but have some solid ways to verify their claims. Often in doing this verification I came up with new ideas.

jillesvangurp 42 minutes ago | parent | prev [-]

Exactly, if I generate a large chunk software, I'm going to have expectations about what it will do, how it will do it, etc. You don't just accept the statement that "it's done" for fact but you start looking for evidence.

A scientific approach here is to look to falsify the statement. You start asking questions, running tests, experiments, etc. to prove the notion that it is done wrong. And at some point you run out of such tests and it's probably done for some useful notion of done-ness.

I've built some larger components and things with AI. It's never a one shot kind of deal. But the good news is that you can use more AI to do a lot of the evaluation work. And if you align your agents right, the process kind of runs itself, almost. Mostly I just nudge it along. "Did you think about X? What about Y? Let's test Z"

tags2k 2 hours ago | parent | prev | next [-]

I'm no physics professor but this aligns with the way I use the tools in my "senior engineer" space. I bring the fundamentals to sanity-check the trigger-happy agent and try to imbue other humans with those fundamentals so they can move towards doing the same. It feels like the only way this whole thing will work (besides eventually moving to local models that do less but companies can afford).

illiac786 31 minutes ago | parent | prev | next [-]

Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn. It does not, and it is for the human brain a formidable task to remember that something as smart as an LLM does not learn. I keep catching myself making the same mistake.

It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually.

I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then.

timschmidt 27 minutes ago | parent [-]

They can form new associations between concepts via their input prompts and thinking text. That is a form of learning. Just not very durable. I liken it to https://en.wikipedia.org/wiki/Anterograde_amnesia

illiac786 26 minutes ago | parent [-]

yeah, I should have been more specific: I meant the type of learning that mentoring fosters, the long term learning.

timschmidt 22 minutes ago | parent [-]

I hear you. I think we are already seeing some middle ground with agentic systems using RAG, skills.md files, etc. It's a sort of disassociated card catalog memory. An engineer's notebook. Not the integrated, correlated, pre-processed set of relationships in the model. How to go backward from the notebook -> model cheaply without tanking performance is definitely one of those billion dollar questions.

mixtureoftakes 2 hours ago | parent | prev | next [-]

please, sign up for a paid plan of either chatgpt or claude. gemini is while close, still noticeably behind

you deserve opinions shaped by interactions with the best tools that are out there.

wg0 an hour ago | parent | next [-]

Gemini feels deep and philosophical. Especially for product management. Tell him you're a product manager and we're a team of two.

But regular reminder - All LLMs can be wrong all the time. I only work with LLMs in domains I'm expert in OR I have other sources to verify their output with utmost certainty.

cubefox an hour ago | parent | prev | next [-]

Gemini is certainly not behind Claude in terms of physics.

hodgehog11 an hour ago | parent | prev | next [-]

ChatGPT and Gemini are actually fairly comparable.

Claude has been utterly useless with most math problems in my experience because, much like less capable students, it tends to get overly bogged down in tedious details before it gets to the big picture. That's great for programming, not so much for frontier math. If you're giving it little lemmas, then sure it's great, but otherwise you're just burning tokens.

peyton 2 hours ago | parent | prev [-]

Seriously, it’s not worth reaching for less intelligence. Use Extended Pro 100% of the time for things you’d spend the amount of time GP spent writing their post.

recursivecaveat 2 hours ago | parent | prev | next [-]

This is close to my experience with code. LLMs can pick out small mistakes from giant code changes with surprising accuracy, or slowly narrow down a weird. On the other hand I've seen them bravely shoulder on under completely incorrect conceptual models of what they're working with and churn around in circles consequently, spin up giant piles of slop to re-implement something they decided was necessary, but didn't bother to search for, or outright dismiss important error signals as just 'transient failures'. Unlimited stamina, low wisdom.

tasuki 17 minutes ago | parent | prev | next [-]

> in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars.

I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation...

wood_spirit 2 hours ago | parent | prev | next [-]

Chiming in to agree but clarify that the latest sota models are no better than Gemini.

I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them.

So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one.

I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau.

DeathArrow 43 minutes ago | parent | prev | next [-]

I don't think the experience with Gemini will be the same when using GPT.

cyanydeez 2 hours ago | parent | prev [-]

I've been watching the automation of things like flight control systems for the past decade, and the evolution of the fallback to a real pilot in the event of a emergency is what's most concerning about where LLMs are being embedded.

Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable.

At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly.

leptons 21 minutes ago | parent [-]

We're on the road to Idiocracy.