| ▲ | cj 4 hours ago | |||||||||||||||||||||||||||||||||||||
I love the tax use case. What scares me though is how I've (still) seen ChatGPT make up numbers in some specific scenarios. I have a ChatGPT project with all of my bloodwork and a bunch of medical info from the past 10 years uploaded. I think it's more context than ChatGPT can handle at once. When I ask it basic things like "Compare how my lipids have trended over the past 2 years" it will sometimes make up numbers for tests, or it will mix up the dates on a certain data points. It's usually very small errors that I don't notice until I really study what it's telling me. And also the opposite problem: A couple days ago I thought I saw an error (when really ChatGPT was right). So I said "No, that number is wrong, find the error" and instead of pushing back and telling me the number was right, it admitted to the error (there was no error) and made up a reason why it was wrong. Hallucinations have gotten way better compared to a couple years ago, but at least ChatGPT seems to still break down especially when it's overloaded with a ton of context, in my experience. | ||||||||||||||||||||||||||||||||||||||
| ▲ | arjie 4 hours ago | parent | next [-] | |||||||||||||||||||||||||||||||||||||
In my case, what I like to do is extract data into machine-readable format and then once the data is appropriately modeled, further actions can use programmatic means to analyze. As an example, I also used Claude Code on my taxes: 1. I keep all my accounts in accounting software (originally Wave, then beancount) 2. Because the machinery is all in programmatically queriable means, the data is not in token-space, only the schema and logic I then use tax software to prep my professional and personal returns. The LLM acts as a validator, and ensures I've done my accounts right. I have `jmap` pull my mail via IMAP, my Mercury account via a read-only transactions-only token and then I let it compare against my beancount records to make sure I've accounted for things correctly. For the most part, you want it to be handling very little arithmetic in token-space though the SOTA models can do it pretty flawlessly. I did notice that they would occasionally make arithmetic errors in numerical comparison, but when using them as an assistant you're not using them directly but as a hypothesis generator and a checker tool and if you ask it to write out the reasoning it's pretty damned good. For me Opus 4.6 in Claude Code was remarkable for this use-case. These days, I just run `,cc accounts` and then look at the newly added accounts in fava and compare with Mercury. This is one of those tedious-to-enter trivial-to-verify use-cases that they excel at. To be honest, I was fine using Wave, but without machine-access it's software that's dead to me. | ||||||||||||||||||||||||||||||||||||||
| ▲ | shepherdjerred 4 hours ago | parent | prev | next [-] | |||||||||||||||||||||||||||||||||||||
I've gotten better results by telling it "write a Python program to calculate X" | ||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||
| ▲ | ElFitz 3 hours ago | parent | prev [-] | |||||||||||||||||||||||||||||||||||||
I’d say for these use cases it’s better to make it build the tools that do the thing than to make it doing the thing itself. And it usually takes just as long. | ||||||||||||||||||||||||||||||||||||||