What is Local AI?
AI used to live in the cloud. Now it can live on your machine — privately, offline, and fast.
The old way: everything in the cloud
When you type a question into ChatGPT or generate an image on Midjourney, your request travels to a data center thousands of miles away. A powerful server runs the AI model, sends the answer back, and logs your interaction along the way. You never see the model itself — you're renting access to it.
This works well, but it has real trade-offs: your prompts leave your machine, you need an internet connection, and you're paying per query indefinitely.
The new way: AI on your own hardware
Local AI means running the model yourself — on your laptop, desktop, or workstation. The model weights (the file that encodes the AI's knowledge) live on your drive. Your GPU does the computation. Nothing leaves your machine.
Why it matters
Privacy
Your prompts never leave your machine. Ideal for sensitive documents, personal notes, or proprietary code.
Low latency
No round-trip to a server. First token appears in milliseconds rather than seconds.
Offline
Works on a plane, at a remote site, or anywhere without reliable internet.
Cost over time
No per-token or per-image fees. After the hardware cost, running is essentially free.
Control
Fine-tune, swap models, change settings — your hardware, your rules.
What does "running a model" actually mean?
An AI model is a very large file — often 4–80 GB — containing billions of numbers (called parameters). When you run inference, your GPU reads those numbers from memory and performs billions of mathematical operations per second to produce each word or pixel.
The key constraint is VRAM — the memory on your graphics card. The entire model needs to fit there to run at full speed. That's why GPU memory size is the first question in any "can I run this?" conversation.
What can you actually run?
Consumer GPUs today are genuinely capable of running impressive AI. Here's a rough guide to what's possible today:
| VRAM | What runs comfortably |
|---|---|
| 8 GB | 7–8B language models (4-bit quantized) and Stable Diffusion image generation (SD 1.5, SDXL with optimizations). Not enough headroom for video. |
| 12 GB | 8–13B LLMs (quantized) and Flux image generation. Comfortable for most image workflows. |
| 16 GB | 13B LLMs at higher quality, Flux comfortably, plus short, low-resolution video clips (e.g. LTX-Video). |
| 24 GB | Up to ~32B LLMs (quantized), high-quality image work, and mainstream local video (HunyuanVideo, Wan 2.x). |
| 32 GB | 32B LLMs comfortably; 70B only with aggressive quantization. Smooth video generation and longer context windows. |
| 128 GB | DGX Spark territory: 70B models at FP8/INT8, up to ~200B quantized, fine-tuning, and very long context. |
A modern GeForce RTX GPU is a complete Local AI workstation. An RTX 4060 with 8 GB VRAM can already run a capable 7B language model and generate images with Stable Diffusion — all offline, all private.
For researchers and power users who need to run 70B+ models without quantization — or fine-tune their own models — the DGX Spark's 128 GB of unified memory changes the equation entirely. It runs models no single consumer GPU can hold.