▲ | adastra22 10 hours ago | |||||||||||||||||||||||||
What I mean is that all processing in an LLM occurs in state space. The next-token prediction is the very last step. | ||||||||||||||||||||||||||
▲ | uniqueuid 10 hours ago | parent [-] | |||||||||||||||||||||||||
There are many more weird and complex architectures in models for video understanding. For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor. See this paper for a pretty thorough overview. Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432 | ||||||||||||||||||||||||||
|