Remix.run Logo
stillsut 4 days ago

I think the "magic" that we've found a common toolset of methods - embeddings and layers of neural networks - that seem to reveal useful patterns and relationships from a vast array of corpus of unstructured analog sensors (pictures, video, point clouds) and symbolic (text, music) and that we can combine these across modalities like CLIP.

It turns out we didn't need a specialist technique for each domain, there was a reliable method to architect a model that can learn itself, and we could already use the datasets we had, they didn't need to be generated in surveys or experiments. This might seem like magic to an AI researcher working in the 1990's.