Remix.run Logo
joshstrange a day ago

I might test this out, but I worry that it suffers from the same problems that I ran into the last time I played with LLMs writing queries. Specifically not understanding your schema. It might understand relations but most production tables have oddly named columns, potentially columns that changed function overtime, potentially deprecated columns, internal-lingo columns, and the list goes on.

Granted, I was using 3.5 at the time, but even with heavy prompting and trying to explain what certain tables/columns are used for, feeding it the schema, and feeding it sample rows, more often than not it produced garbage. Maybe 4o/o3/Claude4/etc can do better now, but I’m still skeptical.

liquidki a day ago | parent | next [-]

I think this is the achilles heel of LLM-based AI: the attention mechanisms are far, far, inferior to a human, and I haven't seen much progress here. I regularly test models by feeding in a 20-30 minute transcript of a podcast and ask them to state the key points.

This is not a lot of text, maybe 5 pages. I then skim it myself in about 2-3 minutes and I write down what I would consider the key points. I compare the results and I find the AI usually (over 50% of the time) misses 1 or more points that I would consider key.

I encourage everyone to reproduce this test just to see how well current AI works for this use case.

For me, AI can't adequately do one of the first things that people claim it does really well (summarization). I'll keep testing, maybe someday it will be satisfactory in this, but I think this is a basic flaw in the attention mechanism that will not be solved by throwing more data and more GPUs at the problem.

joshstrange a day ago | parent | next [-]

> I encourage everyone to reproduce this test just to see how well current AI works for this use case.

I do this regularly and find it very enlightening. After I’ve read a news article or done my own research on a topic I’ll ask ChatGPT to do the same.

You have to be careful when reading its response to not grade on a curve, read it as if you didn’t do the research and you don’t know the background. I find myself saying “I can see why it might be confused into thinking X but it doesn’t change the fact that it was wrong/misleading”.

I do like when LLM‘s cite their sources, mostly because I find out they’re wrong. Many times I’ve read a summary, then followed it to the source, read the entire source, and realized it says nothing of the sort. But almost always, I can see where it glued together pieces of the source, incorrectly.

A great micro example of this are the Apple Siri summaries for notifications. Every time they mess up hilariously I can see exactly how they got there. But it’s also a mistake that no human would ever make.

pu_pu a day ago | parent | prev [-]

[dead]

pu_pu a day ago | parent | prev | next [-]

This is not a difficult problem to solve. We can add the schema, columns and column descriptions in the system prompt. It can significantly improve performance.

All it will take is a form where the user supplies details about each column and relation. For some reason, most LLM based apps don't add this simple feature.

joshstrange a day ago | parent [-]

It’s not a difficult problem to solve, I did it, last year, with 3.5, it didn’t help. That’s not to say that newer models wouldn’t do better, but I have tried this approach. It is a difficult problem to actually get working.

pu_pu a day ago | parent [-]

So, I have not tried it on a very complex database myself so I can't comment how well it will work in production systems I have tried this approach with a single Big Query table and it worked pretty well for my toy example.

If by 3.5 you mean ChatGPT 3.5 you should absolutely try it with newer models, there is a huge difference in capabilities.

joshstrange a day ago | parent [-]

Yes, ChatGPT 3.5, this testing was a while back. I’m sure it has improved but I doubt it’s solid enough for me to trust.

Example/clean/demo datasets it does very well on. Incredibly impressive even. But on real world schema/data for an app developed over many years, it struggled. Even when I could finally prompt my way into getting it to work for 1 type of query, my others would randomly break.

It would have been easier to just provide tools for hard-coded queries if I wanted to expose a chat interface to the data.

pu_pu a day ago | parent [-]

[dead]

brulard a day ago | parent | prev | next [-]

I got better results with Claude Code + PostgreSQL MCP. I let claude understand my drizzle schema first, and i can instruct it to also look at the usage of some entities in the code. Then it is smarter in understanding what the data represents.

nicktikhonov a day ago | parent | prev [-]

might be possible to solve this with prompt configuration. e.g. you'd be able to explain to the llm all the weird naming conventions and unintuitive mappings

joshstrange a day ago | parent [-]

I did that the last time (again, only with 3.5, things have hopefully improved in this area).

And I could potentially see LLMs being useful to generate the “bones” of a query for me but I’d never expose it to end-users (which was what I was playing with). So instead of letting my users do something like “What were my sales for last month?” I could use LLMs to help build queries that were hardcoded for various reports.

The problem is that I know SQL, I’m pretty good at, and I have a perfect understanding of my company’s schema. I might ask an LLM a generic SQL question but trying to feed it my schema just leads to (or rather “led to” in my trials before) prompt hell. I spent hours tweaking the prompts, feeding it more context, begging with it to ignore the “cash” column that has been depreciated for 4+ years, etc. After all of that it still would make simple mistakes that I hard specially warned against.