| ▲ | Show HN: FieldOps-Bench an open eval for physical-world AI agents(camerasearch.ai) | |||||||
| 1 points by Aeroi 9 hours ago | 3 comments | ||||||||
Hey HN, I'm Pete. I'm a boat captain by trade, but I've spent the last 16 months building Camera Search. Agents in the physical world need a different set of skills to be useful. We've optimized our harness and architecture to specialize in diagnosing and fixing problems in traditional industries like mining, oil & gas, telecom, construction, and the skilled trades. Existing benchmarks didn't cover what workers in these industries actually do day-to-day, so today I'm publishing FieldOps-Bench on github and Hugging Face [https://huggingface.co/datasets/CameraSearch/fieldopsbench]. It's a 157 case multimodal benchmark across 7 industries, testing visual diagnostics, code/standard citations, and general industrial field knowledge. I ran it against our agent and the frontier models. Camera Search beat Claude Opus 4.6 on 87% of cases. I scored it two ways: a rubric and pairwise judging. I'm not a benchmarks specialist, so criticism is welcome, and yes, it's apples-to-oranges because my agent has tool use the baseline models don't. I think it still shows what's possible when you tune the system and corpus for a specific vertical instead of relying on a general-purpose model. Happy to answer any questions, and would love to connect with people building agents for the physical world. Especially where the stakes are high and the information is incomplete. -Pete | ||||||||
| ▲ | Aeroi 9 hours ago | parent | next [-] | |||||||
One thing that surprised me is how much code citation data is in most of the models training data already. Where the agents still fall apart is visual analysis like a corroded valve photo with a vague description and they'll confidently cite the wrong API standard. That gap is most of where the 87% delta comes from for us. Happy to walk through specific cases if anyone wants to dig in. | ||||||||
| ▲ | nigardev 8 hours ago | parent | prev [-] | |||||||
visual analysis is the right bottleneck to call out. most coding agents can read and write code fine because its just text. but identify a corroded valve from a photo and suggest the right fix? thats a different problem entirely. curious how your benchmark scores the gap between text-reasoning and visual-reasoning tasks | ||||||||
| ||||||||