Seems like the search is based only on the transcript/dialogue - not an image embedding. Would be super cool to actually use some CLIP/embedding search on these for a more effective fuzzy lookup.

▲

petercooper 4 hours ago | parent | next [-]

Agreed. If you search for Barney, say, none of the top ten picture him at all and is mostly people speaking to or about him. Even running them through a vision LLM for a list of keywords would yield better results than the subtitles, I suspect.

▲

adzm 12 hours ago | parent | prev [-]

How would someone go about doing this, just curious?

	▲	wincy 12 hours ago \| parent [-]
		You’d just run every picture through CLIP, essentially you run an image generator backwards. Instead of text to image like most end users use when using something like stable diffusion (been awhile since I’ve done this), it can do the exact opposite and generate tokens (just words in this case) to describe the input image. I’d guess famous characters like Bart and Marge and other Simpsons characters would likely be known by the tokenizer so it’d be pretty easy. So then you’d be able to guess. Feel free to correct me on small details if anyone has this more fresh in their mind but I’m roughly correct here.