From here it looks like it still is long context and multimodal though?

>Inputs and outputs Input:

Text string, such as a question, a prompt, or a document to be summarized

Images, normalized to 896 x 896 resolution and encoded to 256 tokens each

Total input context of 128K tokens Output:

Generated text in response to the input, such as an answer to a question, analysis of image content, or a summary of a document

Total output context up to 32K tokens

▲

rhdunn 3 days ago | parent [-]

If you are finetuning the model you need to replicate the training conditions so you don't remove those capabilities. If you just finetune a multi-modal model on text it will lose some of the vision capabilities as the text part of the model will drift from the vision, audio, etc. models. A similar thing happens with finetuning reasoning models.

Even if you did finetune the models with text and images then you could run into issues with using different descriptions for images to what it was trained with. Though you could probably work around that by getting the model to describe the images, but you'll still need to audit the results to correct any issues or add what you are training for.

You can also run into overfitting if your data does not include enough variations along a given training set that the original model had access to.

Using different training parameters could also affect the models capabilities. Just knowing things like the input context isn't enough.

	▲	CuriouslyC 3 days ago \| parent \| next [-]
		This is the thing that kills me about SFT. It was sensible when most of the compute in a model was in pretraining and the RL was mostly for question answering. Now that RL is driving model capabilities it doesn't make much sense. On the other hand, RL on deployed systems looks promising to essentially JIT optimize models. Experiments with model routers and agentic rag have shown good results.
	▲	navvyeanand 3 days ago \| parent \| prev [-]
		This is very true. However, I wonder how much of this can be mitigated by using training data from other open-source models like Olmo3 for textual data, Emu3.5 for vision?