Can you use this model with image and text-only inputs apart from video?

by lunahr - opened 1 day ago

The video capability is cool, but can you also perform inference with images or just text?

Also would freezing the vision encoder work to train the LLM part of the model but keep its capabilities to use the vision encoder?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment