LLMs for Everyday Use — Local AI Guide

What You Can Actually Do

Running an LLM locally is not just a hobby project — it replaces real daily workflows. Once a model is downloaded, every prompt runs on your hardware, stays on your machine, and costs nothing per query. Here are the most practical everyday uses.

Core Use Cases

General Chat & Q&A

Open a chat interface, ask a question, get an answer. Local LLMs like Llama 3.1 8B or Gemma 2 9B are competent conversationalists for research, brainstorming, explaining concepts, and drafting text. They run at 40–80 tokens/sec on a mid-range RTX GPU — fast enough that responses feel instant.

Coding Assistance

Code-specialized models like Qwen2.5-Coder 32B, DeepSeek-Coder V2, and StarCoder2 are trained specifically on code. They can complete functions, explain error messages, refactor logic, and generate boilerplate. Larger coding models (32B+) beat smaller general models significantly on benchmarks like HumanEval and SWE-bench.

Tools like Continue (VS Code / JetBrains extension) and Cursor (AI-native editor) connect directly to your local Ollama server via the OpenAI-compatible API endpoint, so you get IDE-integrated completions with zero data leaving your machine. Claude Code can also be pointed at a local Ollama backend for privacy-sensitive projects.

Writing & Summarization

Paste a long article, meeting transcript, or PDF text and ask for a summary, a list of action items, or a rewritten version in a different tone. A 7B model handles most summarization tasks well; 13B+ models produce noticeably more nuanced rewrites. Long-context models (with 32K–128K context windows) can process entire reports in a single prompt.

RAG Over Personal Files

Retrieval-Augmented Generation lets you "chat with" your own documents. A RAG system indexes your files as vector embeddings, retrieves the most relevant chunks when you ask a question, and feeds them to the LLM as context. The result is a private search engine that answers in natural language — useful for research notes, codebases, legal docs, or knowledge bases. Tools like AnythingLLM package this into a point-and-click workflow.

Tools for Running Local LLMs

Ollama

CLI Runner

The easiest way to download and run models from the command line. One command — ollama run llama3.1 — downloads and starts a model. Exposes an OpenAI-compatible REST API on port 11434 for integration with UIs and dev tools. Works on macOS, Linux, and Windows.

LM Studio

GUI + Runner

A desktop application with a model browser (backed by Hugging Face), in-app chat, and a built-in local server. Ideal for users who prefer a GUI over the command line. Handles quantized GGUF models automatically and shows VRAM usage in real time.

Open WebUI

Browser UI

A self-hosted ChatGPT-style web interface that connects to Ollama or any OpenAI-compatible backend. Supports conversation history, system prompts, model switching, image uploads (for multimodal models), and multi-user setups. Run it locally via Docker or pip.

AnythingLLM

RAG + Chat

A full-featured desktop app for building private RAG pipelines. Drag in PDFs, Word docs, or web URLs, and it indexes them into a local vector database. Connect to Ollama, LM Studio, or cloud APIs. Supports workspaces, agents, and multi-document retrieval.

Jan

Desktop App

A privacy-first desktop LLM client that downloads and manages models locally. Minimalist interface with a focus on keeping everything on-device. Supports GGUF models via llama.cpp under the hood and offers an OpenAI-compatible local API server mode.

Model Recommendations by Task

Task	Recommended Size	Example Models	Why
General chat / Q&A	7–8B	Llama 3.1 8B, Gemma 2 9B, Mistral 7B	Fast, fits in 6–8 GB VRAM, good quality for everyday tasks
Coding assistance	13–32B	Qwen2.5-Coder 32B, DeepSeek-Coder V2 16B	Larger context and code-specific training pay off significantly for complex code
RAG / document Q&A	7–13B	Llama 3.1 8B, Phi-3 Medium 14B	Retrieval provides the facts; model just needs good instruction-following
Long document summarization	70B or high-context model	Llama 3.1 70B, Mistral Large 2 (123B)	Large context window + strong reasoning handles multi-chapter documents
Writing & editing	13–32B	Command R+ 104B (quantized), Qwen2.5 32B	Nuanced tone, style awareness, and coherence improve with scale
Quick offline lookup	3–4B	Phi-3 Mini 3.8B, Gemma 2 2B	Sub-second response; fits in CPU RAM if needed

Tip: Start with 7–8B, move up as needed. A Q4-quantized 8B model uses ~5 GB of VRAM and answers most everyday questions well. Only upsize when you notice quality problems — larger models are slower and need more VRAM, so there is a real tradeoff.

Setting Up Your First Local Chat

The fastest path to a working local LLM in three steps:

Install Ollama from ollama.com. It installs a background service and CLI tool.
Pull a model: run ollama pull llama3.1:8b in a terminal. The model downloads to your local cache (~5 GB).
Chat: run ollama run llama3.1:8b and start typing. Or install Open WebUI for a browser-based chat interface at localhost:3000.

For coding integration, install the Continue VS Code extension and point it at http://localhost:11434 (Ollama's default API address). You now have a private GitHub Copilot alternative.

How this maps to your RTX / Spark

RTX 30 and 40 series cards with 8–16 GB VRAM run 7B–13B models smoothly at 30–80 tokens/sec — enough for real-time chat. The RTX 4090 (24 GB) and RTX 5090 (32 GB) push into 32B territory comfortably, handling Qwen2.5-Coder 32B Q4 with headroom to spare. For 70B models at usable speed, the DGX Spark's 128 GB unified memory pool makes them feel like 8B models do on an RTX — always fast, never cramped.