| ▲ | ziotom78 2 hours ago | |||||||||||||||||||||||||
I am a physics professor and often use Gemini to check my papers. It is a formidable tool: it was able to find a clerical error (a missing imaginary unit in a complex mathematical expression) I was not able to find for days, and it often underlines connections between concepts and ideas that I overlooked. However, it often makes conceptual errors that I can spot only because I have good knowledge of the topic I am discussing. For instance, in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars. Good to know that ChatGPT 5.5 Pro can produce a publishable paper, but from what I have seen so far with Gemini, it seems to me that it is better to consider LLMs as very efficient students who can read papers and books in no time but still need a lot of mentoring. | ||||||||||||||||||||||||||
| ▲ | Quothling 3 minutes ago | parent | next [-] | |||||||||||||||||||||||||
We've got a rather extensive AI setup through our equity fund and I've setup a group of agents for data architecture at scale. One is the main agent I discuss with and it's setup to know our infrastructure and has access to image generation tools, websearch, hand off agents and other things. I tend to use Opus (4-6 currently) and I find it to be rather great. As you point out it comes with the danger of making mistakes, and again, as you point out, it's not an issue for things I'm an expert on. What I rely on it for, however, is analysing how specific tools would fit into our architecture. In the past you would likely have hired a group of consultants to do this research, but now you can have an AI agent tell you what the advantages and disadvantages of Microsoft Fabric in your setup. Since I don't know the capabilities of Fabric I can't tell if the AI gives me the correct analysis of a Lakehouse and a Warehouse (fabric tools). What I do to mitigate this is that I have fact checking agents configured to be extremely critical and non-biased on Opus, Gemini and GPT. Which are then handed the entire conversation to review it. Then it's handed off to a Opus agent which is setup to assume everything is wrong. After this, and if I'm convinced something is correct I'll hand the entire thing off to a sonnet agent, which is setup to go through the source material and give me a compiled list of exactly what I'll need to verify. It's ridicilously effective, but I do wonder how it would work with someone who couldn't challenge to analytic agent on domain knowledge it gets wrong. Because despite knowing our architecture and needs, it'll often make conceptional errors in the "science" (I'm not sure what the English word for this is) of data architecture. Each iteration gets better though, and with the image generation tools, "drawing" the architecture for presentations from c-level to nerds is ridiclously easy. | ||||||||||||||||||||||||||
| ▲ | nopinsight 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
I assume you're using the "regular" Pro version of Gemini 3.1 for the above, rather than the Deep Think mode, which is more comparable to GPT-5.5 Pro. To my knowledge, regular 3.1 Pro is a tier below and often makes mistakes. Moreover, there's no reason to believe the progress of LLMs, which couldn't reliably solve high-school math problems just 3–4 years ago, will stop anytime soon. You might want to track the progress of these models on the CritPt benchmark, which is built on *unpublished, research-level* physics problems: Frontier models are still nowhere near solving it, but progress has been rapid. * o3 (high) <1.5 years ago was at 1.4% * GPT 5.4 (xhigh), 23.4% * GPT-5.5 (xhigh), 27.1% * GPT-5.5 Pro (xhigh) 30.6%. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | maximamas an hour ago | parent | prev | next [-] | |||||||||||||||||||||||||
LLMs are at their best when you have an expectation for their output. I generally know the shape of the correct response and that allows me to evaluate it's output on it's "vibes", rather than line by line. If there's no expectation then I have to take everything at face value and now I'm at the mercy of the machine. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | tags2k 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
I'm no physics professor but this aligns with the way I use the tools in my "senior engineer" space. I bring the fundamentals to sanity-check the trigger-happy agent and try to imbue other humans with those fundamentals so they can move towards doing the same. It feels like the only way this whole thing will work (besides eventually moving to local models that do less but companies can afford). | ||||||||||||||||||||||||||
| ▲ | illiac786 31 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||
Using the word “Mentoring” is anthropomorphic and subconsciously makes you think it will learn. It does not, and it is for the human brain a formidable task to remember that something as smart as an LLM does not learn. I keep catching myself making the same mistake. It’s also because it is so annoying to have to manage the memory of the LLM with custom prompts/instructions manually. I have not yet played with the long term memory feature, but I fear it will be even less reliable than prompts, simply because in one year or two years so much will have changed again that this “memory” will have to be redone multiple times by then. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | mixtureoftakes 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
please, sign up for a paid plan of either chatgpt or claude. gemini is while close, still noticeably behind you deserve opinions shaped by interactions with the best tools that are out there. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||
| ▲ | recursivecaveat 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
This is close to my experience with code. LLMs can pick out small mistakes from giant code changes with surprising accuracy, or slowly narrow down a weird. On the other hand I've seen them bravely shoulder on under completely incorrect conceptual models of what they're working with and churn around in circles consequently, spin up giant piles of slop to re-implement something they decided was necessary, but didn't bother to search for, or outright dismiss important error signals as just 'transient failures'. Unlimited stamina, low wisdom. | ||||||||||||||||||||||||||
| ▲ | tasuki 17 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||
> in 3D Clifford algebras it repeatedly confuses exponential of bivectors and of pseudoscalars. I have no idea what any of those words even mean. I'm sure LLMs make similar obvious-to-professors mistakes in all the domains. Not long ago, we didn't even have chatbots capable of basic conversation... | ||||||||||||||||||||||||||
| ▲ | wood_spirit 2 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||
Chiming in to agree but clarify that the latest sota models are no better than Gemini. I put my stuff through several sota models and round robin them in adversarial collaboration and they are all useful even though, fundamentally, they don’t “understand” anything. But they are super useful delegates as long as deciding on the problem and approach and solution all sits safely in your head so you can challenge them and steer them. So I know the article is about one particular new model acing something and each vendor wants these stories to position their model as now good enough to replace humans and all other models, but working somewhere where I am lucky enough to be able to use all the sota models all the time, I can say that all keep making obvious mistakes and using all adversarially is way better than trusting just one. I look forward to the day one a small open model that we can run ourselves outperforms the sum of all today’s models. That’s when enough is enough and we can let things plateau. | ||||||||||||||||||||||||||
| ▲ | DeathArrow 43 minutes ago | parent | prev | next [-] | |||||||||||||||||||||||||
I don't think the experience with Gemini will be the same when using GPT. | ||||||||||||||||||||||||||
| ▲ | cyanydeez 2 hours ago | parent | prev [-] | |||||||||||||||||||||||||
I've been watching the automation of things like flight control systems for the past decade, and the evolution of the fallback to a real pilot in the event of a emergency is what's most concerning about where LLMs are being embedded. Right now, we have a lot of smart people who have trained for decades to understand where these things go wrong and how to nudge them back, but the pool of people are going to slowly be replaced by less knowledgeable. At some point, a rubicon will be crossed where these systems can't fallback to a human operator and will fail spectacularly. | ||||||||||||||||||||||||||
| ||||||||||||||||||||||||||