Remix clone Hacker News

new | show | ask | jobs Github

	▲	ponyous 4 hours ago
		I don't have the eval results live yet, so I cannot share them yet. I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ... I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet. Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not): `<0.2 → Poor – Misses core intent; largely irrelevant or incorrect. <0.4 → Weak – Partially relevant; significant omissions or errors. <0.6 → Fair – Covers main points but lacks completeness or precision. <0.8 → Good – Mostly accurate; minor gaps or deviations. <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.` Here is the scenario list (prompts are much more detailed): dragon-bottle-stopper editing-param-mid-conv editing-parametric-enclosure editing-swap-material-param editing-text-edit-cube multi-turn-bird-house multi-turn-dice-tower multi-turn-modular-planter multi-turn-phone-stand multi-turn-shelf one-shot-bookend one-shot-cable-clip one-shot-chess-queen one-shot-coaster one-shot-coffee-cup one-shot-dog-tag one-shot-dragon-figurine one-shot-hex-bracket one-shot-keychain-fob one-shot-low-poly-tree one-shot-pegboard-hook one-shot-pi4-case one-shot-threaded-jar [0]: https://grandpacad.com
	▲	NiloCK 3 hours ago \| parent [-]
		Very cool project. Thanks for sharing!