The Layered Stack

Local AI software follows a clear layered architecture. At the bottom are the raw model files. Runners load and serve those models. UIs and developer tools sit on top of the runners and provide the interface you actually use. NVIDIA-specific tools integrate at the runner layer to maximize hardware utilization.

Chat UIs Open WebUI AnythingLLM ยท Jan Dev / Agent Tools Continue ยท Cursor Claude Code NVIDIA Stack TensorRT-LLM ยท NIM G-Assist Model Runners Ollama ยท LM Studio ยท llama.cpp ยท vLLM ยท ComfyUI ยท A1111 / Forge Model Files GGUF ยท SafeTensors ยท Diffusers weights (Hugging Face / Civitai) NVIDIA GPU (CUDA / Tensor Cores)

The local AI stack, from GPU hardware up to user-facing tools

LLM Runners

Ollama

CLI / API Server

The dominant one-command model runner. ollama pull llama3.1 downloads; ollama run llama3.1 starts a session. Exposes an OpenAI-compatible REST API at localhost:11434. Manages model storage and automatically uses CUDA on NVIDIA GPUs. Best starting point for most users.

LM Studio

GUI Runner

A polished desktop app with a built-in model browser, in-app chat, and local server mode. Handles GGUF model management with a visual interface. Ideal for users who prefer clicking over typing. Shows GPU VRAM usage live.

llama.cpp

Low-Level Engine

The foundational C++ inference engine that powers Ollama and LM Studio under the hood. Run it directly for maximum control โ€” custom quantization schemes, CPU offloading, and embedding generation. Essential for developers building their own applications.

vLLM

High-Throughput Server

Designed for serving models at scale with continuous batching and PagedAttention for efficient KV cache management. Dramatically higher throughput than llama.cpp for multi-user workloads. Best for self-hosted team deployments or production endpoints.

Image & Video Generation

ComfyUI

Node-Based GUI

A node-graph workflow editor for Stable Diffusion and related diffusion models. Extraordinarily flexible โ€” you can build custom pipelines, chain models, add ControlNet, LoRA, and custom nodes. The preferred tool for power users and automation. Steep but rewarding learning curve.

Automatic1111 (A1111)

Web UI

The original Stable Diffusion web interface. Massive extension ecosystem, familiar to millions of users, and packed with features: inpainting, outpainting, img2img, scripts, and more. Somewhat slower than ComfyUI for complex workflows but much easier to get started with.

Forge

A1111 Fork

A performance-optimized fork of A1111 with significantly improved memory efficiency and speed on NVIDIA GPUs, especially for SDXL and newer architectures. Drop-in replacement for A1111 extensions; a good upgrade for RTX users who find A1111 slow.

Chat UIs

Open WebUI

Browser Chat

A self-hosted, feature-rich web interface in the style of ChatGPT. Connects to Ollama and any OpenAI-compatible backend. Supports conversation history, system prompts, multi-user with accounts, image uploads, and model switching. Deploy via Docker in minutes.

AnythingLLM

RAG + Chat App

A desktop application for building private RAG pipelines. Drag in documents to create workspaces, ask questions across your files, and get cited answers. Supports agents, web search, and multiple LLM backends including Ollama and LM Studio.

Jan

Desktop App

A minimal, privacy-first desktop LLM client. Manages model downloads, runs everything locally via llama.cpp, and optionally exposes a local API server. Clean UI with no cloud dependencies โ€” models and conversations stay on your machine.

Developer & Agent Tools

Continue

VS Code / JetBrains

An open-source IDE extension that brings AI code completion, chat, and editing to VS Code and JetBrains IDEs. Points to any OpenAI-compatible backend โ€” connect it to your local Ollama server for a fully private GitHub Copilot alternative with zero data leaving your machine.

Cursor

AI Code Editor

A VS Code fork with deep AI integration built in. Supports custom model endpoints including local Ollama servers via its API settings. Offers multi-file context, codebase indexing, and agent-style edits โ€” with local models, your code never touches a cloud provider.

Claude Code

CLI Agent

Anthropic's terminal-native agentic coding tool. Can be configured to route through a local Ollama-compatible endpoint for privacy-sensitive projects or offline environments, while retaining its powerful multi-step coding and file-editing capabilities.

NVIDIA-Specific Stack

TensorRT-LLM

Inference Optimizer

NVIDIA's open-source library for compiling and optimizing LLMs specifically for NVIDIA GPUs. Applies techniques like FP8 quantization, in-flight batching, and kernel fusion to squeeze maximum tokens/sec from RTX and data center hardware. Delivers 2โ€“5ร— throughput improvement over naive PyTorch inference.

NIM Microservices

Optimized Model Containers

NVIDIA Inference Microservices are pre-packaged, GPU-optimized containers for popular models. Download a NIM and get an OpenAI-compatible API endpoint with TensorRT-LLM baked in. Fastest path to production-quality inference on NVIDIA hardware โ€” no manual optimization required.

G-Assist

GeForce AI Assistant

NVIDIA's AI assistant for GeForce PCs. Runs locally on RTX 30/40/50 series GPUs and provides in-game help, system optimization suggestions, and general AI chat โ€” all without a cloud subscription. A consumer-facing showcase of on-device AI capability.

How this maps to your RTX / Spark

Every tool in this list supports NVIDIA CUDA automatically โ€” no configuration needed. Ollama, LM Studio, and ComfyUI all detect your RTX GPU and use it by default. The NVIDIA-specific tools (TensorRT-LLM, NIM) layer on top to unlock the full potential of Tensor Cores and FP8 compute on RTX 50 series. On DGX Spark, vLLM and TensorRT-LLM are the natural choices for maximum throughput when serving 70B+ models to multiple users simultaneously.