Remix.run Logo
heresalexandria 4 hours ago

Something a lot of folks struggling with these systems don't get is that the instruction and management of them is often quite important - just because they're capable doesn't mean they're mind readers.

Most of the skepticism I encounter on this front is due to lack of proper direction, process involving planning and review before execution, and appropriate attention given to evaluation and feedback loops.

If you asked the smartest person in the world to YOLO a task with the sort of instruction the average denier uses to evaluate an LLM, you'd likely find they wouldn't get back what they were expecting either - and if you're evaluating on subpar models/tools, you shouldn't be surprised to get subpar results.

lilbigdoot 4 hours ago | parent | next [-]

I asked Qwen 3.7 pro to create a C# project that takes a string and reverses it, with a single file WASM target. It spun wheels for over 30 minutes and got nothing.

I use LLMs all the time to help me diagnose bugs and work through my designs, but again and again, I am super unimpressed by their coding abilities. I can see how in some cases with a proper harness they probably do a decent job at certain tasks, but almost everything I try to do, they flail.

heresalexandria 4 hours ago | parent [-]

Qwen is a lightweight locally hosted model that's many months behind the SoTA available from the big three - while the crowd here (myself included) is excited for locally hosted models to catch up to the usable baseline, regardless of what benchmarks you based that selection on they aren't there yet.

gmm1990 4 hours ago | parent | prev | next [-]

This seems to be a very generic/common response to any ai critique. It kind of reinforces my point there’s a lot of situations where the appropriate harness isn’t some agent that’s set to ultra high thinking mode. Chat mode gives the better response and answers the question more quickly

heresalexandria 3 hours ago | parent [-]

That's fair, I do agree that you don't need a harness or ultra-high thinking mode for many problems. Many folks evaluate without those things on a task that would benefit from them leading to the sort of attitudes in this article and its comments section, which is where my comment was coming from.

If you're just saying different tools are best suited for different problems, apologies - that's my take as well.

4 hours ago | parent | prev [-]
[deleted]