Remix.run Logo
ponyous 6 hours ago

Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.

Here are the results compared to Gemini 3.5 Flash:

    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%

Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.

Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.

NiloCK 6 hours ago | parent | next [-]

Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?

I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like

- give 3d modelling task

- render and snapshot from a variety of angles

- feed to third-party vision model for a "what is this" type query

- grade on end-to-end accuracy

Bonus points for asking the vision model something like "how beautiful is this 1-10".

ponyous 5 hours ago | parent [-]

I don't have the eval results live yet, so I cannot share them yet.

I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...

I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.

Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):

    <0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
    <0.4 → Weak – Partially relevant; significant omissions or errors.
    <0.6 → Fair – Covers main points but lacks completeness or precision.
    <0.8 → Good – Mostly accurate; minor gaps or deviations.
    <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.
Here is the scenario list (prompts are much more detailed):

    dragon-bottle-stopper
    editing-param-mid-conv
    editing-parametric-enclosure
    editing-swap-material-param
    editing-text-edit-cube
    multi-turn-bird-house
    multi-turn-dice-tower
    multi-turn-modular-planter
    multi-turn-phone-stand
    multi-turn-shelf
    one-shot-bookend
    one-shot-cable-clip
    one-shot-chess-queen
    one-shot-coaster
    one-shot-coffee-cup
    one-shot-dog-tag
    one-shot-dragon-figurine
    one-shot-hex-bracket
    one-shot-keychain-fob
    one-shot-low-poly-tree
    one-shot-pegboard-hook
    one-shot-pi4-case
    one-shot-threaded-jar


[0]: https://grandpacad.com
NiloCK 4 hours ago | parent [-]

Very cool project. Thanks for sharing!

ComputerGuru 5 hours ago | parent | prev [-]

Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?

ponyous 5 hours ago | parent [-]

Absolutely. Running it now, will update this comment in about 30 mins.

Edit: Surprisingly very good results with 3.0 flash with high thinking.

Cost: $0.06

Duration: 3.22 min

Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)

Adherence was on par with 3.5 flash Low thinking

ComputerGuru 5 hours ago | parent [-]

Thanks! I’ve still been using 3.0 a lot, the price-to-performance ratio absolutely kills compared to Google’s other and newer offerings.