The Tools Landscape
A map of the ecosystem โ from model runners to chat UIs to NVIDIA-specific accelerators.
The Layered Stack
Local AI software follows a clear layered architecture. At the bottom are the raw model files. Runners load and serve those models. UIs and developer tools sit on top of the runners and provide the interface you actually use. NVIDIA-specific tools integrate at the runner layer to maximize hardware utilization.
The local AI stack, from GPU hardware up to user-facing tools
LLM Runners
Ollama
CLI / API ServerThe dominant one-command model runner. ollama pull llama3.1 downloads; ollama run llama3.1 starts a session. Exposes an OpenAI-compatible REST API at localhost:11434. Manages model storage and automatically uses CUDA on NVIDIA GPUs. Best starting point for most users.
LM Studio
GUI RunnerA polished desktop app with a built-in model browser, in-app chat, and local server mode. Handles GGUF model management with a visual interface. Ideal for users who prefer clicking over typing. Shows GPU VRAM usage live.
llama.cpp
Low-Level EngineThe foundational C++ inference engine that powers Ollama and LM Studio under the hood. Run it directly for maximum control โ custom quantization schemes, CPU offloading, and embedding generation. Essential for developers building their own applications.
vLLM
High-Throughput ServerDesigned for serving models at scale with continuous batching and PagedAttention for efficient KV cache management. Dramatically higher throughput than llama.cpp for multi-user workloads. Best for self-hosted team deployments or production endpoints.
Image & Video Generation
ComfyUI
Node-Based GUIA node-graph workflow editor for Stable Diffusion and related diffusion models. Extraordinarily flexible โ you can build custom pipelines, chain models, add ControlNet, LoRA, and custom nodes. The preferred tool for power users and automation. Steep but rewarding learning curve.
Automatic1111 (A1111)
Web UIThe original Stable Diffusion web interface. Massive extension ecosystem, familiar to millions of users, and packed with features: inpainting, outpainting, img2img, scripts, and more. Somewhat slower than ComfyUI for complex workflows but much easier to get started with.
Forge
A1111 ForkA performance-optimized fork of A1111 with significantly improved memory efficiency and speed on NVIDIA GPUs, especially for SDXL and newer architectures. Drop-in replacement for A1111 extensions; a good upgrade for RTX users who find A1111 slow.
Chat UIs
Open WebUI
Browser ChatA self-hosted, feature-rich web interface in the style of ChatGPT. Connects to Ollama and any OpenAI-compatible backend. Supports conversation history, system prompts, multi-user with accounts, image uploads, and model switching. Deploy via Docker in minutes.
AnythingLLM
RAG + Chat AppA desktop application for building private RAG pipelines. Drag in documents to create workspaces, ask questions across your files, and get cited answers. Supports agents, web search, and multiple LLM backends including Ollama and LM Studio.
Jan
Desktop AppA minimal, privacy-first desktop LLM client. Manages model downloads, runs everything locally via llama.cpp, and optionally exposes a local API server. Clean UI with no cloud dependencies โ models and conversations stay on your machine.
Developer & Agent Tools
Continue
VS Code / JetBrainsAn open-source IDE extension that brings AI code completion, chat, and editing to VS Code and JetBrains IDEs. Points to any OpenAI-compatible backend โ connect it to your local Ollama server for a fully private GitHub Copilot alternative with zero data leaving your machine.
Cursor
AI Code EditorA VS Code fork with deep AI integration built in. Supports custom model endpoints including local Ollama servers via its API settings. Offers multi-file context, codebase indexing, and agent-style edits โ with local models, your code never touches a cloud provider.
Claude Code
CLI AgentAnthropic's terminal-native agentic coding tool. Can be configured to route through a local Ollama-compatible endpoint for privacy-sensitive projects or offline environments, while retaining its powerful multi-step coding and file-editing capabilities.
NVIDIA-Specific Stack
TensorRT-LLM
Inference OptimizerNVIDIA's open-source library for compiling and optimizing LLMs specifically for NVIDIA GPUs. Applies techniques like FP8 quantization, in-flight batching, and kernel fusion to squeeze maximum tokens/sec from RTX and data center hardware. Delivers 2โ5ร throughput improvement over naive PyTorch inference.
NIM Microservices
Optimized Model ContainersNVIDIA Inference Microservices are pre-packaged, GPU-optimized containers for popular models. Download a NIM and get an OpenAI-compatible API endpoint with TensorRT-LLM baked in. Fastest path to production-quality inference on NVIDIA hardware โ no manual optimization required.
G-Assist
GeForce AI AssistantNVIDIA's AI assistant for GeForce PCs. Runs locally on RTX 30/40/50 series GPUs and provides in-game help, system optimization suggestions, and general AI chat โ all without a cloud subscription. A consumer-facing showcase of on-device AI capability.
Every tool in this list supports NVIDIA CUDA automatically โ no configuration needed. Ollama, LM Studio, and ComfyUI all detect your RTX GPU and use it by default. The NVIDIA-specific tools (TensorRT-LLM, NIM) layer on top to unlock the full potential of Tensor Cores and FP8 compute on RTX 50 series. On DGX Spark, vLLM and TensorRT-LLM are the natural choices for maximum throughput when serving 70B+ models to multiple users simultaneously.