Cloning Bench: Evaluating AI Agents on Visual Website Cloning

	▲	Cloning Bench: Evaluating AI Agents on Visual Website Cloning(github.com)
		2 points by shahules 10 hours ago \| 1 comments

	▲	shahules 10 hours ago \| parent [-]
		My team works on automatic environment generation for RL post-training. One of our projects is using coding agents to build web clones for BUAs/CUAs. We tested Gemini, Claude Code, GLM, and Codex using our harness on their abilities to recreate a Slack workspace and benchmarked their performance. Saw a variety of results: - Gemini 3 Pro: Achieved the highest visual score (0.91 SSIM) but lacked interactive functionality. - Claude Opus 4.6: Developed the most complete application, balancing full interactivity with consistent self-correction. - GLM-5: Produced the best code architecture but reached a plateau in visual improvement. - GPT-5.3 Codex: Initialized quickly but entered a five-hour "scaling spiral" that failed to yield further progress. Next, we’re planning: - More web apps for cloning and benchmarking across the models - More functionality (the trajectory didn’t include full Slack features) - Better scoring for functionality (easier to catch Gemini’s mistake) Repo: https://github.com/vibrantlabsai/cloning-bench Blog post: https://vibrantlabs.com/blog/pa-bench