I did my (out of the ordinary) taxes this year using agents, kind of as an experiment and kind of to save ~$750. Opus 4.6 max in CC, 5.4 xhigh in codex, and 3.1 high in antigravity. All on the $20/mo plans.

I have a day job, a side business, actively trade shares options and futures, and have a few energy credit items.

All were given the same copied folder containing all the needed documents to compose the return, and all were given the same prompt. My goal was that if all three agreed, I could then go through it pretty confidently and fill out the actual submission forms myself.

5.4 nailed it on the first shot. Took about 12 minutes.

3.1 missed one value, because it decided to only load the first 5 pages of a 30 page document. Surprisingly it only took about 2 minutes to complete though. A second prompt and ~10 seconds corrected it. GPT and Gemini now were perfectly aligned with outputs.

4.6 hit my usage limit before finishing after running for ~10 minutes. I returned the next day to have it finish. It ran for another 5 minutes or so before finishing. There were multiple errors and the final tax burden was a few thousand off. On a second prompt asking to check for errors in the problem areas, it was able to output matching values after a couple more minutes.

For my first time using CC and 4.6 (outside of some programming in AG), I am pretty underwhelmed given the incessant hype.

▲

toddmorey 5 hours ago | parent [-]

My taxes are rather complex, so I ran the same exercise to see if Claude agreed with my accountant. An automated second opinion, so to speak. Spent about 6 minutes analyzing all the PDFs and basically nailed it perfectly in one shot.

My only point here is it sure seems the same activity / use case can have wildly different results across sessions or users. Customer support and product development in the age of non-deterministic software is a strange, strange beast.

▲

ozozozd 4 hours ago | parent [-]

What does nailing mean when you ask whether it agreed with your accountant?

	▲	toddmorey 4 hours ago \| parent [-]
		Given the same inputs but not provided the results (output) from our accountant, did it come to the same conclusions or have good analysis as to why it differed? Obviously, accounting is "spreadsheet math" intensive, so Claude wrote some python scripts for that which kept the math very stable. But there were some complex nuances that had taken the accountant and I quite a bit of work to track down and clarify. Claude quickly had a very accurate read on the situation and knew all the right clarifying questions. I'm not yet ready to ever sign a return that's been entirely AI prepared, but I left the exercise pretty impressed.