Remix.run Logo
mingtianzhang 3 days ago

We are always looking for representations that can capture the meaning of information. However, most representations that compress information for retrieval are lossy. For example, embeddings are a form of lossy compression. Similar to the no-free-lunch theorem, no lossy compression method is universally better than another, since downstream tasks may depend on the specific information that gets lost. Therefore, the question is not which representation is perfect, but which representation is better aligned with an AI system. Because AI evolves rapidly, it is difficult to predict the limitations of the next generation of LLMs. For this reason, a good representation for information retrieval in future LLM systems should be closer to how humans represent knowledge.

When a human tries to retrieve information in a library, they first locate a book by category or by using a metadata keyword search. Then, they open the table of contents (ToC) to find the relevant section, and repeat this process as needed. Therefore, I believe the future of AI retrieval systems should mimic this process. The recently popular PageIndex approach (see this discussioin: https://news.ycombinator.com/item?id=45036944) also belongs to this category, as it generates a table-of-contents–like tree for LLMs to reason over. Again, it is a form of lossy compression, so its limitations can be discussed. However, this approach is the closest to how humans perform retrieval.

mingtianzhang 3 days ago | parent | next [-]

A follow-up question is: what is the lossless way to represent knowledge? That would mean reading all the knowledge at once, which is the most accurate but also the least efficient method. Therefore, for different applications, we need to find an appropriate trade-off between accuracy and efficiency. In systems like real-time recommendation, we prefer efficiency over accuracy, so vector-based search is suitable. In domain-specific QA, we prefer accuracy over efficiency, so maybe a table-of-contents–based search may be the better choice.

jandrewrogers 3 days ago | parent | next [-]

This is the subject of the Hutter Prize and the algorithmic information theory that underpins it. There are some hard algorithm and data structure problems underlying lossless approximations of general learning even for relatively closed domains.

As an example, current AI is famously very poor at learning relationships between non-scalar types, like complex geometry, which humans learn with ease. That isn’t too surprising because the same representation problem exists in non-AI computer science.

mingtianzhang 3 days ago | parent | prev [-]

It is also worth mentioning that compression and generative AI are two sides of the same coin. I highly recommend the book "Information Theory, Inference, and Learning Algorithms" by David MacKay, which explores these deep connections.

anothernewdude 3 days ago | parent | prev | next [-]

> Similar to the no-free-lunch theorem, no lossy compression method is universally better than another,

No lunch theorem only works, because they assume you care about every single value of noise. Nobody does. There's a free lunch to be had, and it's order. You don't care about a single pixel difference between two cat pictures, NFL does.

Lossy compression is precisely where NFL does not apply.

mingtianzhang 3 days ago | parent [-]

Just similar in theorem style, I try to emphasise that no lossy representation is universally (i.e. for all downstream tasks) better than another.

simne 3 days ago | parent | prev | next [-]

Looks like, people use hybrid approach, as all these ToC, metadata, etc, are essentially semantic (executed on neural network), but just recognition of text and recognition of characters are neural.

mingtianzhang 3 days ago | parent [-]

Yeah, I guess another point is that with ToC, metadata representation is transparent—people know what information is lost. On the other hand, vector-based representation is a black box. Explainability and transparency are also important considerations in production-level AI systems.

3 days ago | parent | prev | next [-]
[deleted]
quadhome 3 days ago | parent | prev [-]

Humans only retrieve information in a library in that way due to the past limitations on retrieval and processing. The invention of technologies like tables of contents or even the Dewey Decimal Classification are strongly constrained by fundamental technologies like ... the alphabet! And remember, not all languages are alphabetic. And embeddings aren't alphabetic and don't share the same constraints.

I recommend Judith Flanders' "A Place for Everything" as a both a history and survey of the constraints in sorting and organising information in an alphabetic language. It's also a fun read!

tl;dr why would we want an LLM do something as inefficiently as a human?

mingtianzhang 2 days ago | parent [-]

"why would we want an LLM do something as inefficiently as a human?" -- That is a good point. Maybe we should rename artificial intelligence (AI) to super-artificial intelligence (SAI).