Multimodal Models — Local AI Guide

Beyond Text: What "Multimodal" Means

A standard LLM works with text in, text out. A multimodal model can process — and sometimes produce — more than one type of data. The three main modality combinations you'll encounter locally are:

Vision-Language Models (VLMs): accept images as input alongside text. The most common and widely available category.
Audio models: transcribe speech (Whisper), synthesize speech (Kokoro, Coqui), or understand spoken language end-to-end.
Omni models: handle multiple modalities simultaneously — text, image, audio, and sometimes video — in one model. NVIDIA's Nemotron Nano Omni 8B is a local-friendly example.

How Vision-Language Models Work

A VLM adds a vision encoder to a standard text LLM. When you upload an image, here is what happens under the hood:

The image is passed through a vision encoder — typically a CLIP or SigLIP-based model — that converts it into a grid of feature vectors.
A small projection layer (sometimes called an adapter) maps those visual features into the same embedding space as text tokens.
The resulting "image tokens" are inserted into the prompt sequence alongside your text question and fed to the LLM backbone.
The LLM generates a text response conditioned on both the image and text context.

Image → Vision Encoder → Projection → LLM Backbone → Text Response

Local VLM Model Examples

Model	Size	VRAM (Q4)	Vision Encoder	Strengths
Llama 3.2 Vision 11B	11B	~9–10 GB	Custom ViT	Strong instruction-following, Meta's flagship small VLM
Llama 3.2 Vision 90B	90B	~55+ GB	Custom ViT	Near frontier accuracy, requires Spark or multi-GPU
Qwen2-VL 7B	7B	~6–7 GB	NaViT	Excellent OCR, document understanding, dynamic resolution
Qwen2-VL 72B	72B	~45+ GB	NaViT	Top-tier vision benchmark scores, close to GPT-4V level
Nemotron Nano Omni 8B	8B	~7–8 GB	Multi-modal (vision + audio)	NVIDIA omni model: text, image, and audio in one; RTX-optimized
LLaVA 1.6 13B	13B	~10–11 GB	CLIP ViT-L	Well-established VLM, widely supported by Ollama and LM Studio
MiniCPM-V 2.6	8B	~6–7 GB	SigLIP	Exceptionally high resolution support, video frame understanding

VRAM Math: VLMs vs Text-Only Models

A VLM loads two things into VRAM: the LLM backbone and the vision encoder. The vision encoder is typically a ViT (Vision Transformer) model with 300M–1B parameters — adding roughly 1–3 GB of VRAM overhead on top of the text-only equivalent.

Rule of thumb: Budget an extra 1–3 GB of VRAM compared to the same-sized text model. A Llama 3.1 8B text model at Q4 uses ~5 GB; Llama 3.2 Vision 11B at Q4 uses ~9–10 GB — the extra 4 GB covers both the slight parameter increase and the vision encoder. Image resolution also matters: higher-resolution images generate more visual tokens and temporarily spike VRAM usage during inference.

Practical Use Cases

Analyze Photos

Describe scenes, identify objects, count items, read text in images, or answer specific questions about what the camera captured. Useful for cataloguing photos, checking product labels, or understanding diagrams.

Document & PDF Understanding

Screenshot a PDF page or scanned document and ask the model to extract key data, summarize sections, or answer questions. VLMs like Qwen2-VL excel at complex multi-column layouts that confuse simple OCR tools.

Chart & Graph Interpretation

Upload a bar chart, line graph, or infographic and ask the model to describe trends, identify anomalies, or extract specific data points in structured form.

OCR Replacement

Read handwritten notes, receipts, whiteboards, or printed forms. Modern VLMs handle varied fonts, angles, and mixed languages better than many dedicated OCR tools — and they can reason about the content simultaneously.

Vision-Based Coding Help

Screenshot a UI, error dialog, or code snippet and ask the model to explain it or suggest fixes. Useful when you can't easily copy-paste code from a screenshot or locked-down application.

Speech Transcription

OpenAI Whisper runs locally (via faster-whisper or Whisper.cpp) and transcribes audio with near-cloud accuracy. Combine with an LLM for meeting summaries or voice-controlled local AI assistants.

How this maps to your RTX / Spark

Llama 3.2 Vision 11B and Qwen2-VL 7B both fit comfortably in 12–16 GB VRAM — the sweet spot for RTX 4070 through RTX 4080 Super cards. The 90B vision models need 50+ GB and belong to DGX Spark territory, where 128 GB unified memory handles them without breaking a sweat. Nemotron Nano Omni 8B is specifically tuned for RTX GPUs with CUDA acceleration, making it a natural first choice for NVIDIA hardware owners wanting to experiment with omni-modal capabilities.