> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.

I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.

For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.

But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.

▲

plagiarist 3 days ago | parent | next [-]

The dream is probably that the inference software then writes and executes that script without using text generation alone. Analog to how a human might cross off pairs of parentheses to check that example.

	▲	ubutler 2 days ago \| parent [-]
		ChatGPT already does this, albeit in limited circumstances, through the use of its sandbox environment. Asking GPT in thinking mode to, for example, count the number of “l”s in a long text may see it run a Python script to do so. There’s a massive issue with extrapolating to more complex tasks, however, where either you run the risk of prompt injection via granting your agent access to the internet or, more commonly, an exponential degradation in coherence over long contexts.

▲

whateveracct 2 days ago | parent | prev [-]

That's because abstraction is compression of information.