| ▲ | ponyous 4 hours ago | |
I don't have the eval results live yet, so I cannot share them yet. I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ... I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet. Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
Here is the scenario list (prompts are much more detailed):
[0]: https://grandpacad.com | ||
| ▲ | NiloCK 3 hours ago | parent [-] | |
Very cool project. Thanks for sharing! | ||