Can you use this model with image and text-only inputs apart from video?

#4
by lunahr - opened

The video capability is cool, but can you also perform inference with images or just text?

Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?

Sign up or log in to comment