Remix.run Logo
observationist a day ago

Reconfiguring existing proofs in ways that have been tedious or obscured from humans, or using well framed methods in novel ways, will be done at superhuman speeds, and it'll unlock all sorts of capabilities well before we have to be concerned about AGI. It's going to be awesome to see what mathematicians start to do with AI tools as the tools become capable of truly keeping up with what the mathematicians want from the tools. It won't necessarily be a huge direct benefit for non-mathematicians at first, because the abstract and complex results won't have direct applications, but we might start to see millenium problems get taken down as legitimate frontier model benchmarks.

Or someone like Terence Tao might figure out how to wield AI better than anyone else, even the labs, and use the tools to take a bunch down at once. I'm excited to see what's coming this year.

Davidzheng a day ago | parent | next [-]

I don't think there's a real boundary between reconfiguring existing proofs and combining existing methods and "truly novel" math

xorcist a day ago | parent | prev | next [-]

> Reconfiguring existing proofs in ways that have been tedious or obscured from humans,

To a layman, that doesn't sound like very AI-like? Surely there must be a dozen algorithms to effectively search this space already, given that mathematics is pretty logical?

tombert a day ago | parent | next [-]

I actually know about this a bit since it was part of what I was studying with my incomplete PhD.

Isabelle has had the "Sledgehammer" tool for quite awhile [1]. It uses solvers like z3 to search and apply a catalog of proof strategies and then try and construct a proof for your main proof or any remaining subtasks that you have to complete. It's not perfect but it's remarkably useful (even if it does sometimes give you proofs that import like ten different libraries and are hard to read).

I think Coq has Coqhammer but I haven't played with that one yet.

[1] https://isabelle.in.tum.de/dist/doc/sledgehammer.pdf

matu3ba a day ago | parent [-]

1 Does this mean that Sledgehammer and Coqhammer offer concolic testing based on an input framework (say some computing/math system formalization) for some sort of system execution/evaluation or does this only work for hand-rolled systems/mathematical expressions?

Sorry for my probably senseless questions, as I'm trying to map the computing model of math solvers to common PL semantics. Probably there is better overview literature. I'd like to get an overview of proof system runtime semantics for later usage. 2 Is there an equivalent of fuzz testing (of computing systems) in math, say to construct the general proof framework? 3 Or how are proof frameworks (based on ideas how the proof could work) constructed? 4 Do I understand it correct, that math in proof systems works with term rewrite systems + used theory/logic as computing model of valid representation and operations? How is then the step semantic formally defined?

zozbot234 a day ago | parent | next [-]

1. Sledgehammer/CoqHammer are automated proof-search/assistant tools that bridge interactive proof assistants (Isabelle/Coq) to external SMT provers, they aren't about concolic testing in a PL sense. They translate goals and context to formats the external provers understand, call those provers, and replay/translate returned proofs or proof fragments back into the ITP, that's search and translation of a complete proof, not running concrete+symbolic program executions like concolic testing.

2. there is no exact analogue of fuzz-testing bytecode for "theorem fuzzing", but arguably the closest match is counterexample generators - model finders, finite-model search, SMT instantiation with concrete valuations, all serve a role similar to fuzzers by finding invalid conjectures quickly.

these are weaker than fuzzing for finding execution bugs because mathematical statements are higher-level and provers operate symbolically.

3. here's how proof frameworks are constructed, at the high-level:

a. start by picking a logical foundation: e.g., first-order logic (FOL), higher-order logic (HOL), dependent type theory (Martin-Löf/CIC).

b. define syntax (terms, types, formulas) and typing/judgment rules (inference rules or typing rules).

c. define proof objects or proof rules: natural deduction, sequent calculus, Hilbert-style axioms, or type-as-proofs (Curry-Howard).

d. pick or design the kernel: small, trusted inference engine that checks proof objects (reduction rules, conversion, rule admissibility).

e. add automation layers: tactics, decision procedures, external ATP/SMT, rewriting engines.

f. provide libraries of axioms/definitions and extraction/interpretation mechanisms (code extraction, models).

g. implement proof search strategies and heuristics (backtracking, heuristics, lemma databases, proof-producing solvers).

this is the standard engineering pipeline behind modern ITPs and automated systems.

4. yes, many proof assistants and automated provers treat computation inside proofs as:

a. term rewriting / reduction (beta-reduction, delta-unfolding, normalization) for computation oriented parts; this is the "computation" layer.

b. a separate deductive/logic layer for reasoning (inference rules, quantifier instantiation, congruence closure).

the combined model is: terms represent data/computation; rewriting gives deterministic computation semantics; logical rules govern valid inference about those terms. Dependent type theories conflate computation and proof via conversion (definitional equality) in the kernel.

5. here's how a proof step's semantics is defined:

proof steps are applications of inference rules transforming sequents/judgments. formally:

a. a judgment form J (e.g., Γ ⊢ t : T or Γ ⊢ φ) is defined.

b. inference rules are of the form: from premises J1,...,Jn infer J, written as (J1 ... Jn) / J.

c. a proof is a finite tree whose nodes are judgments; leaves are axioms or assumptions; each internal node follows an inference rule.

d. for systems with computation, a reduction relation → on terms defines definitional equality; many rules use conversion: if t →* t' and t' has form required by rule, the rule applies.

e. in type-theoretic kernels, the step semantics is checking that a proof object reduces/normalizes and that constructors/eliminators are used respecting typing/conversion; the kernel checks a small set of primitive steps (e.g., beta, iota reductions, rule applications).

operationally: a single proof step = instantiate a rule schema with substitutions for metavariables, perform any required reductions/conversions, and check side conditions (freshness, well-formedness).

equational reasoning and rewriting: a rewrite rule l → r can be applied to a term t at a position p if the subterm at p matches l under some substitution σ; result is t with that subterm replaced by σ(r). Confluence/termination properties determine global behavior.

higher-level tactics encode sequences of such primitive steps (rule applications + rewrites + searches) but are not part of kernel semantics; only the kernel rules determine validity.

relevant concise implications for mapping to PL semantics:

treat proof state as a machine state (context Γ, goal G, local proof term), tactics as programs that transform state; kernel inference rules are the atomic instruction set; rewriting/normalization are deterministic evaluation semantics used inside checks.

automation ≈ search processes over nondeterministic tactic/program choices; model finders/SMTs act as external oracles producing either counterexamples (concrete) or proof certificates (symbolic).

wizzwizz4 a day ago | parent | prev [-]

These questions are hard to understand.

1. Yes, the Sledgehammer suite contains three testing systems that I believe are concolic (quickcheck, nitpick, nunchaku), but they only work due to hand-coded support for the mathematical constructs in question. They'd be really inefficient for a formalised software environment, because they'd be operating at a much lower-level than the abstractions of the software environment, unless dedicated support for that software environment were provided.

2. Quickcheck is a fuzz-testing framework, but it doesn't help to construct anything (except as far as it constructs examples / counterexamples). Are you thinking of something that automatically finds and defines intermediary lemmas for arbitrary areas of mathematics? Because I'm not aware of any particular work in that direction: if computers could do that, there'd be little need for mathematicians.

3. By thinking really hard, then writing down the mathematics. Same way you write computer programs, really, except there's a lot more architectural work. (Most of the time, the computer can brute-force a proof for you, so you need only choose appropriate intermediary lemmas.)

4. I can't parse the question, but I suspect you're thinking of the meta-logic / object-logic distinction. The actual steps in the term rewriting are not represented in the object logic: the meta-logic simply asserts that it's valid to perform these steps. (Not even that: it just does them, accountable to none other than itself.) Isabelle's meta-logic is software, written in a programming language called ML.

matu3ba a day ago | parent [-]

Sorry, it's 4am and I should sleep, but got very interested. Thank you very much for the excellent overview. This explains all my current questions.

Legend2440 a day ago | parent | prev | next [-]

The issue with traditional logic solvers ('good old-fashioned AI') is that the search space is extremely large, or even infinite.

Logic solvers are useful, but not tractable as a general way to approach mathematics.

zozbot234 a day ago | parent [-]

> Logic solvers are useful, but not tractable as a general way to approach mathematics.

To be clear, there are explicitly computationally tractable fragments of existing logics, but they're more-or-less uninteresting by definition: they often look like very simple taxonomies (i.e. purely implicational) or like a variety of "modal" and/or "multi-modal" constructions over simpler logics.

Of course it would be nice to explicitly tease out and write down the "computationally tractable" general logical reasoning that some existing style of proof is implicitly relying on (AIUI this kind of inquiry would generally be comprised under "synthetic mathematics", trying to find simple treatments in axiom- and rule-of-inference style for existing complex theories) but that's also difficult.

AndrewKemendo a day ago | parent | prev [-]

Define laymen here

The fact of how you use the term AI tells me that you are a representative of laymen so what precisely are you trying to define?

It might be helpful to understand the term artificial intelligence first:

https://kemendo.com/Understand-AI.html

sublinear a day ago | parent | prev | next [-]

I agree only with the part about reconfiguring existing proofs. That's the value here. It is still likely very tedious to confirm what the LLMs say, but at least it's better than waiting for humans to do this half of the work.

For all topics that can be expressed with language, the value of LLMs is shuffling things around to tease out a different perspective from the humans reading the output. This is the only realistic way to understand AI enough to make it practical and see it gain traction.

As much as I respect Tao, I feel like his comments about AI usage can be misleading without carefully reading what he is saying in the linked posts.

roadside_picnic a day ago | parent | next [-]

> It is still likely very tedious to confirm what the LLMs say,

A large amount of Tao's work is around using AI to assist in creating Lean proofs.

I'm generally on the more skeptical side of things regarding LLMs and grand visions, but assisting in the creation of Lean proofs is a huge area of opportunity for LLMs and really could change mathematics in fundamental ways.

One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case. We have proofs that are gigabytes (I believe even terabytes in some cases) in size, but we know they are correct because they check in Lean.

This particular pattern of using state of the art work in two different areas (LLMs and theorem proving) absolutely has the potentially to fundamentally change how mathematics is done. There's a great picture on pp 381 of Type Theory and Formal Proof where you can easily see how LLMs can be placed in two of the most tricky parts of that diagram to solve.

Because the work is formally verified we can throw out entire classes of LLM problems (like hallucinations).

Personally I think strongly typed language, with powerful type systems are also the long term ideal coding with LLMs (but I'm less optimistic about devs following this path).

j2kun a day ago | parent | next [-]

> A large amount of Tao's work

Perhaps you meant, "a decent amount of his recent work." He has been doing math long before LLMs, and is still regularly publishing papers with collaborators that have nothing to do with AI. The most recent was last week. https://arxiv.org/search/math?searchtype=author&query=Tao,+T

roadside_picnic a day ago | parent [-]

You are correct, I assumed context was implied, but I do mean "recent work with LLMs". Friends of mine where doing side projects with him about two decades ago, and I have a few of his books on my shelf, so yes, I am aware that Terry Tao was doing work in mathematics prior to the advent of LLMs.

zozbot234 a day ago | parent | prev | next [-]

> I don't believe that's what's happening in this specific example (and am probably wrong), but this is where a lot of Tao's enthusiasm lies.

It absolutely is. With the twist that ChatGPT 5.2 can now also "explain" an AI-generated Lean proof in human-readable terms. This is a game changer, because "refactoring" can now become end-to-end: if the human explanation of a Lean proof is hard to grok and could be improved, you can test changes directly on the formal text and check that the proof still goes through for the original statement.

roadside_picnic a day ago | parent [-]

Thank you, I had corrected it earlier when I had some time to further investigate what was happening.

Formal verification combined with AI is, imho, exactly the type of thinking that gets the most value out of the current state of LLMs.

amluto a day ago | parent | prev | next [-]

If you consider the statement that perfect play by both sides in checked results in a draw to be the statement of a theorem, then the proof is 237GB compressed :) And verifying it requires quite a lot of computation.

https://www.science.org/cms/asset/7f2147df-b2f1-4748-9e98-1a...

cocoto a day ago | parent | prev | next [-]

> One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case.

That’s one of the main reason why I did not pursue an academic math career. The pure joy of solving exam problems with elegant proofs is very hard to get on harder problems.

wizzwizz4 a day ago | parent | prev | next [-]

> One naive belief many people have is that proofs should be "intelligible" but it's increasingly clear this is not the case.

That's not a naïve belief. Intelligible proofs represent insight that can be applied to other problems. If our only proof is an opaque one, that means we don't really understand the area yet. Take, for example, the classification of finite simple groups (a ten-thousand-page proof): that is very much not a closed area of research, and we're still discovering new things in the vicinity of the problem.

threethirtytwo a day ago | parent | prev [-]

Math is the tip of the iceberg. If it can do proofs, it can do anything.

robotresearcher a day ago | parent [-]

I don’t have proofs to solve every day, but I have to cycle the dishwasher. I eagerly await.

threethirtytwo a day ago | parent [-]

Things like that it can’t do. But your job is more likely to be a target. Depends on what you do though.

a day ago | parent | prev [-]
[deleted]
malux85 a day ago | parent | prev | next [-]

This is what has excited me for many years - the idea I call "scientific refactoring"

What happens if we reason upwards but change some universal constants? What happens if we use Tao instead of Pi everywhere, these kind of fun questions would otherwise require an enormous intellectual effort whereas with the mechanisation and automation of thought, we might be able to run them and see!

kridsdale3 a day ago | parent | next [-]

Not just for math, but ALL of Science suffers heavily from a problem of less than 1% of the published works being capable of being read by leading researchers.

Google Scholar was a huge step forward for doing meta-analysis vs a physical library.

But agents scanning the vastness of PDFs to find correlations and insights that are far beyond human context-capacity will I hope find a lot of knowledge that we have technically already collected, but remain ignorant of.

semi-extrinsic a day ago | parent | next [-]

This idea is just ridiculous to anyone who's worked in academia. The theory is nice, but academic publishing is currently in the late stages of a huge death spiral.

In any given scientific niche, there is a huge amount of tribal knowledge that never gets written down anywhere, just passed on from one grad student to the rest of the group, and from there spreads by percolation in the tiny niche. And papers are never honest about the performance of the results and what does not work, there is always cherry picking of benchmarks/comparisons etc.

There is absolutely no way you can get these kinds of insights beyond human context capacity that you speak of. The information necessary does not exist in any dataset available to the LLM.

charcircuit a day ago | parent [-]

The same could be said about programmers, but we have adapted and started writing it all down so that AI cab use it.

semi-extrinsic 13 hours ago | parent [-]

No no, in comparison to academia, programmers have been extremely diligent at documenting exactly how stuff works and providing fairly reproducible artifacts since the 1960s.

Imagine trying to teach an AI how to code based on only slide decks from consultants. No access todocumentation, no stack overflow, no open source code used in the training data; just sales pitches and success stories. That's close to how absurd this idea is.

13 hours ago | parent [-]
[deleted]
newyankee a day ago | parent | prev | next [-]

Exactly, and I think not every instance can be claimed to be a hallucination, there will be so much latent knowledge they might have explored.

It is likely we might see some AlphaGo type new styles in existing research workflows that AI might work out if there is some verification logic. Humans could probably never go into that space, or may be none of the researchers ever ventured there due to different reasons as progress in general is mostly always incremental.

zozbot234 a day ago | parent | prev [-]

Google Scholar is still ignoring a huge amount of scholarship that is decades old (pre-digital) or even centuries old (and written in now-unused languages that ChatGPT could easily make sense of).

stouset a day ago | parent | prev | next [-]

> What happens if we use Tao instead of Pi everywhere

Literally nothing other than mild convenience. It’s just 2pi.

lapetitejort a day ago | parent | next [-]

Call me a mathematical extremist but I think pi should equal 6.28... and tau, which looks like half of pi, should equal 3.14...

measurablefunc a day ago | parent [-]

In 1897, the Indiana General Assembly attempted to legislate a new value for pi, proposing it be defined as 3.2, which was based on a flawed mathematical proof. This bill, known as the Indiana pi bill, never became law due to its incorrect assertions and the prior proof that squaring the circle is impossible: https://en.wikipedia.org/wiki/Indiana_pi_bill

a day ago | parent | prev | next [-]
[deleted]
measurablefunc a day ago | parent | prev [-]

You're forgetting that some equations have π/2 so on balance nothing will change. It will be the same number of symbols.

ogogmad a day ago | parent [-]

I don't think it's just the sheer number of symbols. It's also the fact that the symbol τ means "turn". So you can say "quarter-turn" instead of π/2.

I'm not sure why that point gets lost in these discussions. And personally, I think of the set of fundamental mathematical objects as having a unique and objective definition. So, I get weirdly bothered by the offset in the Gamma function.

chmod775 a day ago | parent | prev | next [-]

I can write a sed command/program that replaces every occurence of PI with TAU/2 in LaTeX formulas and it'll take me about 30 minutes.

The "intellectual effort" this requires is about 0.

Maybe you meant Euler's number? Since it also relates to PI, it can be used and might actually change the framework in an "interesting way" (making it more awkward in most cases - people picked PI for a reason).

observationist a day ago | parent | next [-]

I think they mean in a more general way - thinking with tau instead of pi might shift the context in terms of another method or problem solving algorithm, or there might be obscure or complex uses of tau or pi that haven't cross-fertilized in the literature - where it might be natural to think of clever extensions or use cases in one context but not the other, and those extensions and extrapolations will be apparent to AI, within reach of a tedious and exhaustive review of existing literature.

I think what they were getting at is something like this: The application of existing ideas that simply haven't been applied in certain ways because it's too boring or obvious or abstract for humans to have bothered with, but AI can plow through a year's worth of human drudgery in a day or a month or so, and that sort of "brute force" won't require any amazing new technical capabilities from AI.

saulpw a day ago | parent | prev [-]

Yeah but you also have to replace all (2*tau/2) with tau, and 4*(tau/2)^2 with tau^2, etc etc...

sublinear a day ago | parent | prev | next [-]

* Tau

ogogmad a day ago | parent | prev | next [-]

I'm using LLMs to rewrite every formula featuring the Gamma function to instead use the factorial. Just let "z!" mean "Gamma(z+1)", substitute everywhere, and simplify. Then have the AI rewrite any prose.

kelipso 13 hours ago | parent [-]

I’m going to replace every instance of 1 with 0.999 repeating, do the equivalent for all all integers, and see how my mind totally explodes.

HardCodedBias a day ago | parent | prev [-]

Think of how this opened up EM:

https://ddcolrs.wordpress.com/2018/01/17/maxwells-equations-...

ComplexSystems a day ago | parent | prev [-]

If this isn't AGI, what is? It seems unavoidable that an AI which can prove complex mathematical theorems would lead to something like AGI very quickly.

pfdietz a day ago | parent | next [-]

Tao has a comment relevant to that question:

"I doubt that anything resembling genuine "artificial general intelligence" is within reach of current #AI tools. However, I think a weaker, but still quite valuable, type of "artificial general cleverness" is becoming a reality in various ways.

By "general cleverness", I mean the ability to solve broad classes of complex problems via somewhat ad hoc means. These means may be stochastic or the result of brute force computation; they may be ungrounded or fallible; and they may be either uninterpretable, or traceable back to similar tricks found in an AI's training data. So they would not qualify as the result of any true "intelligence". And yet, they can have a non-trivial success rate at achieving an increasingly wide spectrum of tasks, particularly when coupled with stringent verification procedures to filter out incorrect or unpromising approaches, at scales beyond what individual humans could achieve.

This results in the somewhat unintuitive combination of a technology that can be very useful and impressive, while simultaneously being fundamentally unsatisfying and disappointing - somewhat akin to how one's awe at an amazingly clever magic trick can dissipate (or transform to technical respect) once one learns how the trick was performed.

But perhaps this can be resolved by the realization that while cleverness and intelligence are somewhat correlated traits for humans, they are much more decoupled for AI tools (which are often optimized for cleverness), and viewing the current generation of such tools primarily as a stochastic generator of sometimes clever - and often useful - thoughts and outputs may be a more productive perspective when trying to use them to solve difficult problems."

This comment was made on Dec. 15, so I'm not entirely confident he still holds it?

https://mathstodon.xyz/@tao/115722360006034040

ben_w a day ago | parent | prev | next [-]

The "G" in "AGI" stands for "General".

While quickly I noticed that my pre-ChatGPT-3.5 use of the term was satisfied by ChatGPT-3.5, this turned out to be completely useless for 99% of discussions, as everyone turned out to have different boolean cut-offs for not only the generality, but also the artificiality and the intelligence, and also what counts as "intelligence" in the first place.

That everyone can pick a different boolean cut-off for each initial, means they're not really booleans.

Therefore, consider that this can't drive a car, so it's not fully general. And even those AI which can drive a car, can't do so in genuinely all conditions expected of a human, just most of them. Stuff like that.

throw310822 a day ago | parent [-]

> consider that this can't drive a car, so it's not fully general

So blind people are not general intelligences?

AxEy a day ago | parent [-]

A blind person does not have the necessary input (sight data) to make the necessary computation. A car autopilot would.

So no we do not deem a blind person to be unintelligent due to their lack of being able to drive without sight. But we might judge a sighted person as being not generally intelligent if they could not drive with sight.

epolanski a day ago | parent | prev | next [-]

AGI in its standard definition requires matching or surpassing humans on all cognitive tasks, not just in some, especially some where only handful of humans took a stab on.

pfdietz a day ago | parent | next [-]

Since no human could do that, are we to conclude no human is intelligent?

ACS_Solver a day ago | parent | prev [-]

Surely AGI would be matching humans on most tasks. To me, surpassing humans on all cognitive tasks sounds like superintelligence, while AGI "only" need to perform most, but not necessarily all, cognitive tasks at the level of a human highly capable at that task.

fc417fc802 a day ago | parent | next [-]

Personally I could accept "most" provided that the failures were near misses as opposed to total face plants. I also wouldn't include "incompatible" tasks in the metric at all (but using that to game the metric can't be permitted either). For example the typical human only has so much working memory, so tasks which overwhelm that aren't "failed" so much as "incompatible". I'm not sure exactly what that looks like for ML but I expect the category will exist. A task that utilizes adversarial inputs might be an example of such.

epolanski 15 hours ago | parent | prev [-]

Super intelligence is defined as outmatching the best humans in a field, but again, on all cognitive tasks, not just a subset.

AI can already beat humans in pretty much any game like Go or Chess or many videogames, but that doesn't make it general.

mkl a day ago | parent | prev [-]

This is very narrow AI, in a subdomain where results can be automatically verified (even within mathematics that isn't currently the case for most areas).

threethirtytwo a day ago | parent [-]

Narrow AI? I’m not saying it’s AGI but this is not a narrow AI it’s a general AI given a narrow problem. ChatGPT.

gf000 a day ago | parent [-]

In a very specialized setup, in tandem with a verifier.

Just because a specialized human placed in an F-16 can fly at Mach 2.0, doesn't mean humans in general can fly.

threethirtytwo a day ago | parent [-]

An apt analogy. A human is a general intelligence that can fly with an F-16.

What happens when we put an artificial general intelligence in an F-16? That's what happened here with this proof.

mkl 19 hours ago | parent [-]

Not really. A completely unintelligent autopilot can fly an F-16. You cannot assume general intelligence from scaffolded tool-using success in a single narrow area.

threethirtytwo 13 hours ago | parent [-]

I didn’t assume agi.

I assumed extreme performance of a general AI matching and exceeding average human intelligence when placed in an F16 or an equivalent cockpit specified for conducting math proofs.

That’s not agi at all. I don’t think you understand that LLMs will never hit agi even when they exceed human intelligence in all applicable domains.

The main reason is they don’t feel emotions. Even if the definition of agi doesn’t currently encompass emotions people like you will move the goal posts and shift the definition until it does. So as AI improves, the threshold will be adjusted to make sure they will never reach agi as it’s an existential and identity crisis to many people to admit that an AI is better than them on all counts.

mkl 6 hours ago | parent [-]

> I didn’t assume agi.

You literally said:

>>> What happens when we put an artificial general intelligence in an F-16? That's what happened here with this proof.

You're claiming I said a lot of things I didn't; everything you seem to be stating about me in this comment is false.

threethirtytwo 2 hours ago | parent [-]

That's called a hypothetical. I didn't say that we put an AGI into an F-16. I asked what the outcome would be. And the outcome is pretty similar. Please read carefully before making a false statement.

>You're claiming I said a lot of things I didn't; everything you seem to be stating about me in this comment is false.

Apologies. I thought you were being deliberate. What really happened is you made a mistake. Also I never said anything about you. Please read carefully.