Remix.run Logo
falloutx 3 hours ago

I dispute 1 & 2 more than 4.

1) Is it actually watching a movie frame by frame or just searching about it and then giving you the answer?

2) Again can it handle very long novels, context windows are limited and it can easily miss something. Where is the proof for this?

4 is probably solved

4) This is more on predictor because this is easy to game. you can create some gibberish code with LLM today that is 10k lines long without issues. Even a non-technical user can do

CjHuber 3 hours ago | parent [-]

I think all of those are terrible indicators, 1 and 2 for example only measure how well LLMs can handle long context sizes.

If a movie or novel is famous the training data is already full of commentary and interpretations of them.

If its something not in the training data, well I don't know many movies or books that use only motives that no other piece of content before them used, so interpreting based on what is similar in the training data still produces good results.

EDIT: With 1 I meant using a transcript of the Audio Description of the movie. If he really meant watch a movie I'd say thats even sillier because well of course we could get another Agent to first generate the Audio Description, which definitely is possible currently.

zdragnar 3 hours ago | parent [-]

Just yesterday I saw an article about a police station's AI body cam summarizer mistakenly claim that a police officer turned into a frog during a call. What actually happened was that the cartoon "princess and the frog" was playing in the background.

Sure, another model might have gotten it right, but I think the prediction was made less in the sense of "this will happen at least once" and more of "this will not be an uncommon capability".

When the quality is this low (or variable depending on model) I'm not too sure I'd qualify it as a larger issue than mere context size.

CjHuber 3 hours ago | parent [-]

My point was not that those video to text models are good like they are used for example in that case, but more generally I was referring to that list of indicators. Like surely when analysing a movie it is alright if some things are misunderstood by it, especially as the amount of misunderstanding can be decreased a lot. That AI body camera surely is optimized on speed and inference cost. but if you give an agent 10 1s images along with the transcript of that period and the full prior transcript, and give it reasoning capabilities, it would take almost endlessy for that movie to process but the result surely will be much better than the body cameras. After all the indicator talks about "AI" in general so judge a model not optimized for capability but something else to measure on that indicator