Six modules — from "why local" to a fully tuned local chatbot
1Why Run AI Locally
Understanding the tradeoffs before you install anything.
Local AI vs. Cloud AI
Same idea — run inference — in two very different places. Neither is universally "better"; they trade off differently depending on what you need.
- Runs on someone else's servers (OpenAI, Anthropic, Google)
- Access via an API call or web app over the internet
- Scales to huge models (100B+ params) with no local hardware
- Usually metered — you pay per token or per subscription
- Your prompts and data pass through a third party's servers
- Runs entirely on your own laptop or desktop
- Access via a CLI (Ollama) or desktop app (LM Studio)
- Limited to smaller models your hardware can actually hold
- Free to run once downloaded — no per-token billing
- Nothing leaves your machine — fully private by default
When Local Models Make Sense
Local isn't "better" or "worse" than cloud — it's the right tool for specific situations.
🔒 Privacy-Sensitive Work
Legal, medical, or proprietary business data that should never leave your device.
📡 Offline / Unreliable Internet
Fieldwork, travel, or environments where you can't depend on a connection.
⚡ Low-Latency Needs
No network round-trip — responses start generating immediately.
💰 Cost at Scale
Running thousands of requests without per-token API charges.
🧪 Experimentation & Learning
Freely test prompts, fine-tuning, and parameters without burning credits.
🎯 Custom / Fine-Tuned Models
Running a model you've fine-tuned on your own data or task.
🧩 Puzzle: Local or Cloud?
Select a scenario chip below, then click the bucket where it belongs.
2Choosing the Right Model
Matching parameter count and quantization to the machine you actually have.
What Does "1.2B" Actually Mean?
The number refers to parameters — the learned weights inside the model. A parameter is one learned number (a weight or bias) inside the network. More parameters generally means more capacity to capture patterns — but also more memory and compute required to run.
| Params | Tier | Typical Hardware | Examples |
|---|---|---|---|
| ~1–3B | Tiny | Runs on almost any laptop, even CPU-only | LFM2-1.2B, Qwen2.5-1.5B |
| ~7–9B | Small | Comfortable on a mid-range laptop with 16GB RAM | Llama 3.1 8B, Gemma 2 9B |
| ~13–34B | Medium | Needs a dedicated GPU with 16GB+ VRAM for good speed | Mixtral, Qwen2.5 32B |
| 70B+ | Large | Multi-GPU workstation or server territory | Llama 3.1 70B, DeepSeek 67B |
Quantization: Shrinking Models to Fit
Quantization stores each weight with fewer bits — smaller file, less memory, faster inference, slight quality loss.
| Format | Bits / Weight | Relative Size | Notes |
|---|---|---|---|
| FP16 | 16 bits | 100% | Full precision. Highest quality, largest size. |
| Q8_0 | 8 bits | ~50% | Near-lossless quality at half the size. |
| Q4_K_M | ~4.5 bits | ~28% | The sweet spot most people use locally. |
| Q2_K | ~2.5 bits | ~16% | Very small, but noticeable quality loss. |
CPU vs. GPU Inference
- Works on any machine — no special hardware needed
- Uses system RAM, usually more plentiful than VRAM
- Slower token generation, especially on larger models
- The default fallback in both Ollama and LM Studio
Best for: small models (1–4B) or machines without a dedicated GPU.
- Needs a compatible GPU (NVIDIA CUDA, Apple Metal, AMD ROCm)
- Model must fit in VRAM — the GPU's dedicated fast memory
- Dramatically faster token generation via massive parallelism
- Both tools auto-detect and offload layers to GPU when available
Best for: 7B+ models, or fast, snappy responses for daily use.
🛠️ Interactive Tool: Hardware Matchmaker
Drag the sliders to describe your machine — we'll recommend a model tier and quantization to start with.
3Ollama & LM Studio
The two most popular ways to run models locally — one CLI, one GUI.
A lightweight, command-line-first runtime
- Runs as a background service; interact via terminal or its API
- Ships models as versioned "model names" — pull like Docker images
- Uses a Modelfile system to configure prompts & parameters
- Exposes a local REST API (default port 11434)
- Extremely fast to get started: install → one command → chatting
| Interface | Command line + local API |
| Platforms | macOS, Windows, Linux |
| Best for | Developers, scripting, automation |
| License | Free & open-source |
A polished desktop app for local models
- A full graphical app — browse, download, chat visually
- Built-in model search with hardware compatibility indicators
- Chat UI with sliders for temperature, top-K, top-P, context
- Can run a local server mimicking the OpenAI API format
- Great for beginners who want to see and tune everything
| Interface | Desktop GUI + optional local API |
| Platforms | macOS, Windows, Linux |
| Best for | Beginners, visual experimentation |
| License | Free to use |
Side by Side
| Ollama | LM Studio | |
|---|---|---|
| Interface | Command line | Graphical desktop app |
| Learning curve | A little steeper — needs terminal comfort | Very beginner-friendly |
| Parameter tuning | Via Modelfile or API request | Live sliders in the chat UI |
| Scripting / automation | Excellent — built for it | Possible via its local server |
| Model discovery | ollama.com library, pull by name | Built-in search with compatibility hints |
| Resource overhead | Minimal, runs as a lean service | Slightly heavier (full desktop app) |
▶️ Try It: Installing & Setting Up Ollama
4Hands-On: LFM2-1.2B
Downloading and talking to our test model in both tools.
LFM2 is LiquidAI's "Liquid Foundation Model" family, built around a hybrid architecture designed for fast, efficient inference on everyday hardware — not just data-center GPUs.
| Parameters | 1.2 billion |
| Typical size (Q4) | ~700 MB – 1 GB |
| Minimum RAM | ~2 GB free |
| Runs well on | CPU-only laptops |
| Good for | Chat, summarizing, quick drafting |
✔️ Interactive Checklist — Running It in Ollama
What Just Happened? (Puzzle)
You just sent your first local message. Put these five steps back in the correct order using the arrows.
5Generation Parameters Explained
The dials that shape how a model picks its next word.
At every step, the model outputs a probability for every possible next token. Temperature reshapes that distribution, top-K throws away everything except the K most likely tokens, and top-P keeps adding tokens until their probabilities add up to P. Most tools apply all three, in that order, every single token.
🎛️ Live Simulator: Temperature × Top-K × Top-P
The Full Pipeline (Puzzle)
Every token goes through these five stages, in order. Reorder them correctly.
Context Window: The Model's Working Memory
The maximum number of tokens (prompt + conversation + response) the model can "see" at once. Once this window fills up, the oldest tokens must be dropped to make room for new ones — longer conversations get "forgotten."
💬 Interactive: Context Window Explorer
Recommended Settings Cheat Sheet
| Task | Temperature | Top-K | Top-P |
|---|---|---|---|
| Factual Q&A / Coding | 0.1 – 0.3 | 20 – 40 | 0.8 – 0.9 |
| General Chat / Assistant | 0.5 – 0.7 | 40 – 60 | 0.9 – 0.95 |
| Creative Writing | 0.8 – 1.1 | 60 – 100 | 0.92 – 0.98 |
| Brainstorming / Ideation | 1.0 – 1.3 | 80 – 120 | 0.95 – 1.0 |
6Tips & Troubleshooting
The handful of issues almost everyone hits in their first week. Click a symptom to reveal the fix.
Course Recap
- Local AI trades cloud scale for privacy, offline access, and zero per-token cost.
- Model size (parameters) and quantization determine what fits on your machine.
- Ollama gives you a fast CLI workflow; LM Studio gives you a visual, beginner-friendly one.
- LFM2-1.2B is small enough to run comfortably on almost any laptop, CPU-only included.
- Temperature, top-K, and top-P are applied in sequence to shape how tokens get sampled.
- The context window is finite working memory — it fills up and older tokens get dropped.
Master Cheat Sheet
Every core reference table from the course, in one place.
Model Size → Hardware
| 1–2B | 4 GB RAM, any laptop |
| 7–9B | 16 GB RAM / RTX 3060+ |
| 13–14B | 24 GB, 12GB+ VRAM GPU |
| 30–34B | 32–64 GB, RTX 4090 / Apple Max |
| 70B+ | 64–128 GB, multi-GPU |
Quantization Ladder
| FP16 | 100% size · best quality |
| Q8_0 | ~50% size · near-lossless |
| Q4_K_M | ~28% size · sweet spot ✅ |
| Q2_K | ~16% size · visible quality loss |
Sampling Pipeline Order
Raw Probabilities → Apply Temperature → Apply Top-K → Apply Top-P → Sample
Ollama Essentials
ollama pull <model>
ollama
run <model>
ollama list
Glossary — Flip to Reveal
Click any card to flip it and see the definition.
Final Comprehensive Assessment
15 mixed questions spanning all six modules. Your live score is tracked at the top.
Resources & Links
Official docs and libraries mentioned in this course.