Remix.run Logo
lolinder 10 days ago

One of the consistent problems I'm seeing over and over again with LLMs is people forgetting that they're limited by the training data.

Software engineers get hyped when they see the progress in AI coding and immediately begin to extrapolate to other fields—if Copilot can reduce the burden of coding so much, think of all the money we can make selling a similar product to XYZ industries!

The problem with this extrapolation is that the software industry is pretty much unique in the amount of information about its inner workings that is publicly available for training on. We've spent the last 20+ years writing millions and millions of lines of code that we published on the internet, not to mention answering questions on Stack Overflow (which still has 3x as many answers as all other Stack Exchanges combined [0]), writing technical blogs, hundreds of thousands of emails in public mailing lists, and so on.

Nearly every other industry (with the possible exception of Law) produces publicly-visible output at a tiny fraction of the rate that we do. Ethics of the mass harvesting aside, it's simply not possible for an LLM to have the same skill level in ${insert industry here} as they do with software, so you can't extrapolate from Copilot to other domains.

[0] https://stackexchange.com/sites?view=list#answers

steveBK123 10 days ago | parent | next [-]

Yes this is EXACTLY it, and I was discussing this a bit at work (financial services).

In software, we've all self taught, improved, posted Q&A all over the web. Plus all the open source code out there. Just mountains and mountains of free training data.

However software is unique in being both well paying and something with freely available, complete information online.

A lot of the rest of the world remains far more closed and almost an apprenticeship system. In my domain thinks like company fundamental analysis, algo/quant trading, etc. Lots of books you can buy from the likes of Dalio, but no real (good) step by step research and investment process information online.

Likewise I'd imagine heavily patented/regulated/IP industries like chip design, drug design, etc are substantially as closed. Maybe companies using an LLM on their own data internally could make something of their data, but its also quite likely there is no 'data' so much as tacit knowledge handed down over time.

rm445 9 days ago | parent | prev | next [-]

Many other industries haven't yet been fully eaten by software. All kinds of data is locked away and in proprietary formats, and is generated by humans without much automation. I don't think we know where exactly the frontiers are, once someone puts in the work to build large datasets, and automates creation of synthetic training data. Whole industries could suddenly flip from 'impossible' to 'easy' for AI.

mountainriver 10 days ago | parent | prev | next [-]

Yep, this is also the reason LLMs can probably work well for a lot more things if we did have the data

unoti 10 days ago | parent | prev [-]

>The problem with this extrapolation is that the software industry is pretty much unique in the amount of information about its inner workings that is publicly available for training on... millions of lines of code that we published on the internet...

> Nearly every other industry (with the possible exception of Law) produces publicly-visible output at a tiny fraction of the rate that we do.

You are correct! There's lots of information available publicly about certain things like code, and writing SQL queries. But other specialized domains don't have the same kind of information trained into the heart of the model.

But importantly, this doesn't mean the LLM can't provide significant value in these other more niche domains. They still can, and I provide this every day in my day job. But it's a lot of work. We (as AI engineers) have to deeply understand the special domain knowledge. The basic process is this:

1. Learn how the subject matter experts do the work.

2. Teach the LLM to do this, using examples, giving it procedures, walking it through the various steps and giving it the guidance and time and space to think. (Multiple prompts, recipes if you will, loops, external memory...)

3. Evaluation, iteration, improvement

4. Scale up to production

In many domains I work in, it can be very challenging to get past step 1. If I don't know how to do it effectively, I can't guide the LLM through the steps. Consider an example question like "what are the top 5 ways to improve my business" -- the subject matter experts often have difficulty teaching me how to do that. If they don't know how to do it, they can't teach it to me, and I can't teach it to the agent. Another example that will resonate with nerds here is being an effective Dungeons and Dragons DM. But if I actually learn how to do it, and boil it down into repeatable steps, and use GraphRAG, then it becomes another thing entirely. I know this is possible, and expect to see great things in that space, but I estimate it'll take another year or so of development to get it done.

But in many domains, I get access to subject matter experts that can tell me pretty specifically how to succeed in an area. These are the top 5 situations you will see, how you can identify which situation type it is, and what you should do when you see that you are in that kind of situation. In domains like this I can in fact make the agent do awesome work and provide value, even when the information is not in the publicly available training data for the LLM.

There's this thing about knowing a domain area well enough to do the job, but not having enough mastery to teach others how to do the job. You need domain experts that understand the job well enough to teach you how to do it, and you as the AI engineer need enough mastery over the agent to teach it how to do the job as well. Then the magic happens.

When we get AGI we can proceed past this limitation of needing to know how to do the job ourselves. Until we get AGI, then this is how we provide impact using agents.

This is why I say that even if LLM technology does not improve any more beyond where it was a year ago, we still have many years worth of untapped potential for AI. It just takes a lot of work, and most engineers today don't understand how to do that work-- principally because they're too busy saying today's technology can't do that work rather than trying to learn how to do it.

akra 9 days ago | parent [-]

> 1. Learn how the subject matter experts do the work.

This will get harder I think over time as low hanging fruit domains are picked - the barrier will be people not technology. Especially if the moat for that domain/company is the knowledge you are trying to acquire (NOTE: Some industries that's not their moat and using AI to shed more jobs is a win). Most industries that don't have public workings on the internet have a couple of characteristics that will make it extremely difficult to perform Task 1 on your list. The biggest is now every person on the street, through the mainstream news, etc knows that it's not great to be a software engineer right now and most media outlets point straight to "AI". "It's sucks to be them" I've heard people say - what was once a profession of respect is now "how long do you think you have? 5 years? What will you do instead?".

This creates a massive resistance/outright potential lies in providing AI developers information - there is a precedent of what happens if you do and it isn't good for the person/company with the knowledge. Doctors associations, apprenticeship schemes, industry bodies I've worked with are all now starting to care about information security a lot more due to "AI", and proprietary methods of working lest AI accidentally "train on them". Definitely boosted the demand for cyber people again as an example around here.

> You are correct! There's lots of information available publicly about certain things like code, and writing SQL queries. But other specialized domains don't have the same kind of information trained into the heart of the model.

The nightmare of anyone that studied and invested into a skill set according to most people you would meet. I think most practitioners will conscious to ensure that the lack of data to train on stays that way for as long as possible - even if it eventually gets there the slower it happens and the more out of date it is the more useful the human skill/economic value of that person. How many people would of contributed to open source if they knew LLM's were coming for example? Some may have, but I think there would of been less all else being equal. Maybe quite a bit less code to the point that AI would of been delayed further - tbh if Google knew that LLM's could scale to be what they are they wouldn't of let that "attention" paper be released either IMO. Anecdotally even the blue collar workers I know are now hesitant to let anyone near their methods of working and their craft - survival, family, etc come first. In the end after all, work is a means to an end for most people.

Unlike us techies which I find at times to not be "rational economic actors" many non-tech professionals don't see AI as an opportunity - they see it as a threat they they need to counter. At best they think they need to adopt AI, before others have it and make sure no one else has it. People I've chatted to say "no one wants this, but if you don't do it others will and you will be left behind" is a common statement. One person likened it to a nuclear weapons arms race - not a good thing, but if you don't do it you will be under threat later.

aleph_minus_one 9 days ago | parent [-]

> This will get harder I think over time as low hanging fruit domains are picked - the barrier will be people not technology. Especially if the moat for that domain/company is the knowledge you are trying to acquire (NOTE: Some industries that's not their moat and using AI to shed more jobs is a win).

Also consider that there exist quite a lot of subject matter experts who simply are not AI fanboys - not because they are afraid of their job because of AI, but because they consider the whole AI hype to be insanely annoying and infuriating. To get them to work with an AI startup, you will thus have to pay them quite a lot of money.

akra 9 days ago | parent [-]

Indeed. I'm already seeing it in software at least anecdotally where people's will to post code open source/answer Stackoverflow questions, etc are drying up (i.e. am I working hard just to train someone else's AI?). Might be a little too little too late though - there's just too much code out there. This is especially in niche domains where the advantage isn't the generic code itself but how it is applied (e.g. finance, power, etc the list goes on).

After all in a capitalist economy the last to be disrupted generally gets "all the spoils" as purchasing power (and hence prices/wages) move from least scarce/disrupted skills to more scarce skills which allows the last to be disrupted to have more time to accumulate wealth/assets to shield themselves from AI even more.