| ▲ | Imustaskforhelp a day ago | |
This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks? By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly. | ||
| ▲ | hamiltont a day ago | parent [-] | |
Not sure I'm fully following your question, but maybe this helps: IME deep thinking hgas moved from upfront architecture to post-prototype analysis. Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM. The shift: from "design away problems" to "evaluate into solutions." | ||