Absolutely agree. If fundamental investing becomes mostly about maintaining and improving your own AI research system, then a typical SaaS frontend is not enough. The financial institutions furthest ahead on adoption are already thinking this way: over time, most apps probably need to become an API surface as much as a UI surface.
For Primer to be the core research engine for a team, users need to understand not just the output, but how the agent got there: task class, source snapshot, tool calls, evidence used, failure modes, and cases where the agent should not answer.
That is a big part of how we think about the product and the workflow surface is important, but the evaluation surface is what lets users trust, compare and improve the agent over time. Otherwise you are right: it becomes very hard to distinguish a genuinely better research system from a better UI.
Most investment firms are a long way away from having the capability to think about proper evaluation of these types of systems so we should be helping them in this process.