LLMs for Everyday Use
Chat with your own AI, get coding help, summarize documents, and search your files — all offline.
What You Can Actually Do
Running an LLM locally is not just a hobby project — it replaces real daily workflows. Once a model is downloaded, every prompt runs on your hardware, stays on your machine, and costs nothing per query. Here are the most practical everyday uses.
Core Use Cases
General Chat & Q&A
Open a chat interface, ask a question, get an answer. Local LLMs like Llama 3.1 8B or Gemma 2 9B are competent conversationalists for research, brainstorming, explaining concepts, and drafting text. They run at 40–80 tokens/sec on a mid-range RTX GPU — fast enough that responses feel instant.
Coding Assistance
Code-specialized models like Qwen2.5-Coder 32B, DeepSeek-Coder V2, and StarCoder2 are trained specifically on code. They can complete functions, explain error messages, refactor logic, and generate boilerplate. Larger coding models (32B+) beat smaller general models significantly on benchmarks like HumanEval and SWE-bench.
Tools like Continue (VS Code / JetBrains extension) and Cursor (AI-native editor) connect directly to your local Ollama server via the OpenAI-compatible API endpoint, so you get IDE-integrated completions with zero data leaving your machine. Claude Code can also be pointed at a local Ollama backend for privacy-sensitive projects.
Writing & Summarization
Paste a long article, meeting transcript, or PDF text and ask for a summary, a list of action items, or a rewritten version in a different tone. A 7B model handles most summarization tasks well; 13B+ models produce noticeably more nuanced rewrites. Long-context models (with 32K–128K context windows) can process entire reports in a single prompt.
RAG Over Personal Files
Retrieval-Augmented Generation lets you "chat with" your own documents. A RAG system indexes your files as vector embeddings, retrieves the most relevant chunks when you ask a question, and feeds them to the LLM as context. The result is a private search engine that answers in natural language — useful for research notes, codebases, legal docs, or knowledge bases. Tools like AnythingLLM package this into a point-and-click workflow.
Tools for Running Local LLMs
Ollama
CLI RunnerThe easiest way to download and run models from the command line. One command — ollama run llama3.1 — downloads and starts a model. Exposes an OpenAI-compatible REST API on port 11434 for integration with UIs and dev tools. Works on macOS, Linux, and Windows.
LM Studio
GUI + RunnerA desktop application with a model browser (backed by Hugging Face), in-app chat, and a built-in local server. Ideal for users who prefer a GUI over the command line. Handles quantized GGUF models automatically and shows VRAM usage in real time.
Open WebUI
Browser UIA self-hosted ChatGPT-style web interface that connects to Ollama or any OpenAI-compatible backend. Supports conversation history, system prompts, model switching, image uploads (for multimodal models), and multi-user setups. Run it locally via Docker or pip.
AnythingLLM
RAG + ChatA full-featured desktop app for building private RAG pipelines. Drag in PDFs, Word docs, or web URLs, and it indexes them into a local vector database. Connect to Ollama, LM Studio, or cloud APIs. Supports workspaces, agents, and multi-document retrieval.
Jan
Desktop AppA privacy-first desktop LLM client that downloads and manages models locally. Minimalist interface with a focus on keeping everything on-device. Supports GGUF models via llama.cpp under the hood and offers an OpenAI-compatible local API server mode.
Model Recommendations by Task
| Task | Recommended Size | Example Models | Why |
|---|---|---|---|
| General chat / Q&A | 7–8B | Llama 3.1 8B, Gemma 2 9B, Mistral 7B | Fast, fits in 6–8 GB VRAM, good quality for everyday tasks |
| Coding assistance | 13–32B | Qwen2.5-Coder 32B, DeepSeek-Coder V2 16B | Larger context and code-specific training pay off significantly for complex code |
| RAG / document Q&A | 7–13B | Llama 3.1 8B, Phi-3 Medium 14B | Retrieval provides the facts; model just needs good instruction-following |
| Long document summarization | 70B or high-context model | Llama 3.1 70B, Mistral Large 2 (123B) | Large context window + strong reasoning handles multi-chapter documents |
| Writing & editing | 13–32B | Command R+ 104B (quantized), Qwen2.5 32B | Nuanced tone, style awareness, and coherence improve with scale |
| Quick offline lookup | 3–4B | Phi-3 Mini 3.8B, Gemma 2 2B | Sub-second response; fits in CPU RAM if needed |
Setting Up Your First Local Chat
The fastest path to a working local LLM in three steps:
- Install Ollama from
ollama.com. It installs a background service and CLI tool. - Pull a model: run
ollama pull llama3.1:8bin a terminal. The model downloads to your local cache (~5 GB). - Chat: run
ollama run llama3.1:8band start typing. Or install Open WebUI for a browser-based chat interface atlocalhost:3000.
For coding integration, install the Continue VS Code extension and point it at http://localhost:11434 (Ollama's default API address). You now have a private GitHub Copilot alternative.
RTX 30 and 40 series cards with 8–16 GB VRAM run 7B–13B models smoothly at 30–80 tokens/sec — enough for real-time chat. The RTX 4090 (24 GB) and RTX 5090 (32 GB) push into 32B territory comfortably, handling Qwen2.5-Coder 32B Q4 with headroom to spare. For 70B models at usable speed, the DGX Spark's 128 GB unified memory pool makes them feel like 8B models do on an RTX — always fast, never cramped.