Multimodal Models
Models that can see images, hear audio, and talk back — and how they change the VRAM math.
Beyond Text: What "Multimodal" Means
A standard LLM works with text in, text out. A multimodal model can process — and sometimes produce — more than one type of data. The three main modality combinations you'll encounter locally are:
- Vision-Language Models (VLMs): accept images as input alongside text. The most common and widely available category.
- Audio models: transcribe speech (Whisper), synthesize speech (Kokoro, Coqui), or understand spoken language end-to-end.
- Omni models: handle multiple modalities simultaneously — text, image, audio, and sometimes video — in one model. NVIDIA's Nemotron Nano Omni 8B is a local-friendly example.
How Vision-Language Models Work
A VLM adds a vision encoder to a standard text LLM. When you upload an image, here is what happens under the hood:
- The image is passed through a vision encoder — typically a CLIP or SigLIP-based model — that converts it into a grid of feature vectors.
- A small projection layer (sometimes called an adapter) maps those visual features into the same embedding space as text tokens.
- The resulting "image tokens" are inserted into the prompt sequence alongside your text question and fed to the LLM backbone.
- The LLM generates a text response conditioned on both the image and text context.
Image → Vision Encoder → Projection → LLM Backbone → Text Response
Local VLM Model Examples
| Model | Size | VRAM (Q4) | Vision Encoder | Strengths |
|---|---|---|---|---|
| Llama 3.2 Vision 11B | 11B | ~9–10 GB | Custom ViT | Strong instruction-following, Meta's flagship small VLM |
| Llama 3.2 Vision 90B | 90B | ~55+ GB | Custom ViT | Near frontier accuracy, requires Spark or multi-GPU |
| Qwen2-VL 7B | 7B | ~6–7 GB | NaViT | Excellent OCR, document understanding, dynamic resolution |
| Qwen2-VL 72B | 72B | ~45+ GB | NaViT | Top-tier vision benchmark scores, close to GPT-4V level |
| Nemotron Nano Omni 8B | 8B | ~7–8 GB | Multi-modal (vision + audio) | NVIDIA omni model: text, image, and audio in one; RTX-optimized |
| LLaVA 1.6 13B | 13B | ~10–11 GB | CLIP ViT-L | Well-established VLM, widely supported by Ollama and LM Studio |
| MiniCPM-V 2.6 | 8B | ~6–7 GB | SigLIP | Exceptionally high resolution support, video frame understanding |
VRAM Math: VLMs vs Text-Only Models
A VLM loads two things into VRAM: the LLM backbone and the vision encoder. The vision encoder is typically a ViT (Vision Transformer) model with 300M–1B parameters — adding roughly 1–3 GB of VRAM overhead on top of the text-only equivalent.
Practical Use Cases
Analyze Photos
Describe scenes, identify objects, count items, read text in images, or answer specific questions about what the camera captured. Useful for cataloguing photos, checking product labels, or understanding diagrams.
Document & PDF Understanding
Screenshot a PDF page or scanned document and ask the model to extract key data, summarize sections, or answer questions. VLMs like Qwen2-VL excel at complex multi-column layouts that confuse simple OCR tools.
Chart & Graph Interpretation
Upload a bar chart, line graph, or infographic and ask the model to describe trends, identify anomalies, or extract specific data points in structured form.
OCR Replacement
Read handwritten notes, receipts, whiteboards, or printed forms. Modern VLMs handle varied fonts, angles, and mixed languages better than many dedicated OCR tools — and they can reason about the content simultaneously.
Vision-Based Coding Help
Screenshot a UI, error dialog, or code snippet and ask the model to explain it or suggest fixes. Useful when you can't easily copy-paste code from a screenshot or locked-down application.
Speech Transcription
OpenAI Whisper runs locally (via faster-whisper or Whisper.cpp) and transcribes audio with near-cloud accuracy. Combine with an LLM for meeting summaries or voice-controlled local AI assistants.
Llama 3.2 Vision 11B and Qwen2-VL 7B both fit comfortably in 12–16 GB VRAM — the sweet spot for RTX 4070 through RTX 4080 Super cards. The 90B vision models need 50+ GB and belong to DGX Spark territory, where 128 GB unified memory handles them without breaking a sweat. Nemotron Nano Omni 8B is specifically tuned for RTX GPUs with CUDA acceleration, making it a natural first choice for NVIDIA hardware owners wanting to experiment with omni-modal capabilities.