Remix.run Logo
ck2 10 hours ago

I still don't understand how "AI" is ready for serious use beyond entertainment purposes

Every time I ask ChatGPT to make a table for a subject I know well, I will find an error in one of the results and it is very confident about it until I question it in detail

Every time I ask ChatGPT for nutritional breakdown of some dense food source and give it a quantity like 8 ounces and ask for the weight of each ingredient, the weights will be wrong and add up to more than the original weight of 8 ounces

These are variations of the old "how many Rs in strawberry" problem, it's still not solved, "AI" cannot reassemble a complex problem properly

A lot of what it tells me in detail about some subjects sounds suspiciously like Reddit posts reassembled out of order

llbbdd 4 hours ago | parent | next [-]

Two things that I would recommend trying out if you're interested in exploring this further:

1. If you're not paying for a model, the results will be worse. That sucks but the free access models are just not very good for anything where you need to trust the output, even for basic queries.

2. More important than #1 is access to tool use. If the LLM is just producing a nutritional breakdown from its weights, it's almost always going to be wrong. If the LLM is allowed to break the problem down into deterministic steps, it will do a lot better. In the nutritional breakdown case, an LLM with search + tool access can pretty easily break the problem down:

- Searching the web for a recipe or ingredient breakdown for the food

- Searching the web for nutritional qualities of each ingredient per some volume of the ingredient

- Writing and running a script with e.g. Python that takes in the recipe's projected serving output, the desired serving size, the amount of each ingredient etc, and scales the ingredients to match the desired serving size, and sums the nutritional qualities of the scaled ingredients.

I've tried this specific case with Claude + Gemini for my own purposes and they both handle it very well. The challenge currently is that the models will not always arrive at this approach when provided with an ambiguous prompt; sometimes they will, but sometimes they'll just vomit up a fully autocompleted response from their weights. Being more specific in the prompt or defining a skill that details the intended approach lets you get more useful + deterministic results while still taking advantage of the fuzzy glue that LLMs can provide here between steps.

Same with the classic strawberry r-counting case. IIUC LLMs have trouble with this because of how training data is tokenized, but any LLM will have no trouble farming out to e.g.

> echo -n "strawberry" | grep -o "r" | wc -l

> 3

bo1024 3 hours ago | parent | prev [-]

There are basically two kinds of applications. One is where you want to correctly solve the problem at least 99 out of 100 times. LLMs generally don't (and not everybody realizes that) so there are a lot of debates and research around how useful and reliable they are or how to make them so.

The other kind of application is where you can try 100 times and you only need to be right once. Solving a mathematical research problem is like that.