| ▲ | Show HN: PhAIL – Real-robot benchmark for AI models(phail.ai) | |||||||
| 20 points by vertix a day ago | 8 comments | ||||||||
I built this because I couldn't find honest numbers on how well VLA models [1] actually work on commercial tasks. I come from search ranking at Google where you measure everything, and in robotics nobody seemed to know. PhAIL runs four models (OpenPI/pi0.5, GR00T, ACT, SmolVLA) on bin-to-bin order picking – one of the most common warehouse operations. Same robot (Franka FR3), same objects, hundreds of blind runs. The operator doesn't know which model is running. Best model: 64 UPH. Human teleoperating the same robot: 330. Human by hand: 1,300+. Everything is public – every run with synced video and telemetry, the fine-tuning dataset, training scripts. The leaderboard is open for submissions. Happy to answer questions about methodology, the models, or what we observed. [1] Vision-Language-Action: https://en.wikipedia.org/wiki/Vision-language-action_model | ||||||||
| ▲ | chfritz a day ago | parent | next [-] | |||||||
This is absolutely awesome. Thanks for sharing! I would love to chat more with you. For context: we make a remote teleoperation solution for robotics. It's mostly used for mobile robots, but we've been getting a lot of inquiries regarding teleoperation for manipulation, so I've been learning more about this, in particular regarding the question of speed. I really appreciate these results! | ||||||||
| ||||||||
| ▲ | apetrovicheva a day ago | parent | prev | next [-] | |||||||
This is amazing. Loved watching the videos with real-world attempts. Finally a real benchmark vs polished teleoperated twitter videos. Shows the real state of a super important industry, and there’s a lot of work to do. | ||||||||
| ▲ | vladimir_gor a day ago | parent | prev | next [-] | |||||||
I'm a big fan of benchmarks and now finally we have one to evaluate models on physical tasks. Will be interesting to see how fast this gap will narrow. | ||||||||
| ▲ | akshaisarathy a day ago | parent | prev | next [-] | |||||||
If I understand correctly, this is about benchmarking robot models. Do you have a robot to do the benchmarking or is it all simulation? | ||||||||
| ||||||||
| ▲ | anna_pozniak a day ago | parent | prev [-] | |||||||
I'm curious! What other models you're planning to add to the leaderboard? | ||||||||
| ||||||||