Running Small AI Models Locally — Ollama & LM Studio

Course Map

Six modules — from "why local" to a fully tuned local chatbot

01 Why Run AI Locally 02 Choosing the Right Model 03 Ollama & LM Studio 04 Hands-On: LFM2-1.2B 05 Generation Parameters 06 Tips & Troubleshooting

Module 1 · Why Local

1Why Run AI Locally

Understanding the tradeoffs before you install anything.

Local AI vs. Cloud AI

Same idea — run inference — in two very different places. Neither is universally "better"; they trade off differently depending on what you need.

CLOUD AI

Runs on someone else's servers (OpenAI, Anthropic, Google)
Access via an API call or web app over the internet
Scales to huge models (100B+ params) with no local hardware
Usually metered — you pay per token or per subscription
Your prompts and data pass through a third party's servers

LOCAL AI

Runs entirely on your own laptop or desktop
Access via a CLI (Ollama) or desktop app (LM Studio)
Limited to smaller models your hardware can actually hold
Free to run once downloaded — no per-token billing
Nothing leaves your machine — fully private by default

When Local Models Make Sense

Local isn't "better" or "worse" than cloud — it's the right tool for specific situations.

🔒 Privacy-Sensitive Work

Legal, medical, or proprietary business data that should never leave your device.

📡 Offline / Unreliable Internet

Fieldwork, travel, or environments where you can't depend on a connection.

⚡ Low-Latency Needs

No network round-trip — responses start generating immediately.

💰 Cost at Scale

Running thousands of requests without per-token API charges.

🧪 Experimentation & Learning

Freely test prompts, fine-tuning, and parameters without burning credits.

🎯 Custom / Fine-Tuned Models

Running a model you've fine-tuned on your own data or task.

KEY IDEA Local and cloud AI aren't rivals — many practitioners use both, reaching for whichever fits the moment.

🧩 Puzzle: Local or Cloud?

Select a scenario chip below, then click the bucket where it belongs.

Module 2 · Choosing a Model

2Choosing the Right Model

Matching parameter count and quantization to the machine you actually have.

What Does "1.2B" Actually Mean?

The number refers to parameters — the learned weights inside the model. A parameter is one learned number (a weight or bias) inside the network. More parameters generally means more capacity to capture patterns — but also more memory and compute required to run.

Params	Tier	Typical Hardware	Examples
~1–3B	Tiny	Runs on almost any laptop, even CPU-only	LFM2-1.2B, Qwen2.5-1.5B
~7–9B	Small	Comfortable on a mid-range laptop with 16GB RAM	Llama 3.1 8B, Gemma 2 9B
~13–34B	Medium	Needs a dedicated GPU with 16GB+ VRAM for good speed	Mixtral, Qwen2.5 32B
70B+	Large	Multi-GPU workstation or server territory	Llama 3.1 70B, DeepSeek 67B

Quantization: Shrinking Models to Fit

Quantization stores each weight with fewer bits — smaller file, less memory, faster inference, slight quality loss.

Format	Bits / Weight	Relative Size	Notes
FP16	16 bits	100%	Full precision. Highest quality, largest size.
Q8_0	8 bits	~50%	Near-lossless quality at half the size.
Q4_K_M	~4.5 bits	~28%	The sweet spot most people use locally.
Q2_K	~2.5 bits	~16%	Very small, but noticeable quality loss.

RULE OF THUMB Q4_K_M (or Q4_0) is the default most people should reach for — it keeps quality high while roughly quartering the memory footprint of the full model.

CPU vs. GPU Inference

CPU INFERENCE

Works on any machine — no special hardware needed
Uses system RAM, usually more plentiful than VRAM
Slower token generation, especially on larger models
The default fallback in both Ollama and LM Studio

Best for: small models (1–4B) or machines without a dedicated GPU.

GPU INFERENCE

Needs a compatible GPU (NVIDIA CUDA, Apple Metal, AMD ROCm)
Model must fit in VRAM — the GPU's dedicated fast memory
Dramatically faster token generation via massive parallelism
Both tools auto-detect and offload layers to GPU when available

Best for: 7B+ models, or fast, snappy responses for daily use.

🛠️ Interactive Tool: Hardware Matchmaker

Drag the sliders to describe your machine — we'll recommend a model tier and quantization to start with.

System RAM 16 GB

Dedicated GPU?

Module 3 · Tools

3Ollama & LM Studio

The two most popular ways to run models locally — one CLI, one GUI.

OLLAMA

A lightweight, command-line-first runtime

Runs as a background service; interact via terminal or its API
Ships models as versioned "model names" — pull like Docker images
Uses a Modelfile system to configure prompts & parameters
Exposes a local REST API (default port 11434)
Extremely fast to get started: install → one command → chatting

Interface	Command line + local API
Platforms	macOS, Windows, Linux
Best for	Developers, scripting, automation
License	Free & open-source

LM STUDIO

A polished desktop app for local models

A full graphical app — browse, download, chat visually
Built-in model search with hardware compatibility indicators
Chat UI with sliders for temperature, top-K, top-P, context
Can run a local server mimicking the OpenAI API format
Great for beginners who want to see and tune everything

Interface	Desktop GUI + optional local API
Platforms	macOS, Windows, Linux
Best for	Beginners, visual experimentation
License	Free to use

Side by Side

	Ollama	LM Studio
Interface	Command line	Graphical desktop app
Learning curve	A little steeper — needs terminal comfort	Very beginner-friendly
Parameter tuning	Via Modelfile or API request	Live sliders in the chat UI
Scripting / automation	Excellent — built for it	Possible via its local server
Model discovery	ollama.com library, pull by name	Built-in search with compatibility hints
Resource overhead	Minimal, runs as a lean service	Slightly heavier (full desktop app)

▶️ Try It: Installing & Setting Up Ollama

bash — setup

SETTING UP LM STUDIO (NO TERMINAL NEEDED) 1. Download from lmstudio.ai and install like any desktop app · 2. Search for a model by name — LM Studio flags what fits your hardware · 3. Pick Q4_K_M and click Download · 4. Open the Chat tab, load the model, and start typing.

Module 4 · Hands-On

4Hands-On: LFM2-1.2B

Downloading and talking to our test model in both tools.

MODEL CARD — liquid/lfm2.5-1.2b

LFM2 is LiquidAI's "Liquid Foundation Model" family, built around a hybrid architecture designed for fast, efficient inference on everyday hardware — not just data-center GPUs.

Parameters	1.2 billion
Typical size (Q4)	~700 MB – 1 GB
Minimum RAM	~2 GB free
Runs well on	CPU-only laptops
Good for	Chat, summarizing, quick drafting

✔️ Interactive Checklist — Running It in Ollama

IN LM STUDIO INSTEAD Search "lfm2.5-1.2b" → pick the Q4_K_M version (green fit indicator) → Download → open the Chat tab → select the model to load it → type your message. Sliders for temperature, top-K, top-P, and context length sit on the right.

What Just Happened? (Puzzle)

You just sent your first local message. Put these five steps back in the correct order using the arrows.

Module 5 · Parameters

5Generation Parameters Explained

The dials that shape how a model picks its next word.

At every step, the model outputs a probability for every possible next token. Temperature reshapes that distribution, top-K throws away everything except the K most likely tokens, and top-P keeps adding tokens until their probabilities add up to P. Most tools apply all three, in that order, every single token.

🎛️ Live Simulator: Temperature × Top-K × Top-P

Prompt: "The cat sat on the ___" — drag the sliders and watch the candidate pool reshape in real time.

Temperature 0.7

Top-K 5

Top-P 0.90

NUCLEUS SAMPLING Top-P is called "nucleus" sampling because the candidate pool shrinks automatically when the model is confident (one word dominates) and grows automatically when it's uncertain — top-K can't adapt like that.

The Full Pipeline (Puzzle)

Every token goes through these five stages, in order. Reorder them correctly.

FULLY DETERMINISTIC MODE Set temperature → 0, top-K → 1, or top-P → very low, and generation becomes fully deterministic — the same prompt always gives the same output.

Context Window: The Model's Working Memory

The maximum number of tokens (prompt + conversation + response) the model can "see" at once. Once this window fills up, the oldest tokens must be dropped to make room for new ones — longer conversations get "forgotten."

💬 Interactive: Context Window Explorer

Window size fixed at 4,096 tokens · ~250 tokens per exchange. Drag to simulate a growing conversation.

Messages exchanged so far 6

Recommended Settings Cheat Sheet

Task	Temperature	Top-K	Top-P
Factual Q&A / Coding	0.1 – 0.3	20 – 40	0.8 – 0.9
General Chat / Assistant	0.5 – 0.7	40 – 60	0.9 – 0.95
Creative Writing	0.8 – 1.1	60 – 100	0.92 – 0.98
Brainstorming / Ideation	1.0 – 1.3	80 – 120	0.95 – 1.0

FOR LFM2-1.2B SPECIFICALLY Start with temperature 0.6–0.7, top-K 40, top-P 0.9 for general use. Being a smaller model, it benefits from staying a bit more conservative than you might with a larger 7B+ model.

Module 6 · Troubleshooting

6Tips & Troubleshooting

The handful of issues almost everyone hits in their first week. Click a symptom to reveal the fix.

Course Recap

Local AI trades cloud scale for privacy, offline access, and zero per-token cost.
Model size (parameters) and quantization determine what fits on your machine.
Ollama gives you a fast CLI workflow; LM Studio gives you a visual, beginner-friendly one.
LFM2-1.2B is small enough to run comfortably on almost any laptop, CPU-only included.
Temperature, top-K, and top-P are applied in sequence to shape how tokens get sampled.
The context window is finite working memory — it fills up and older tokens get dropped.

NEXT STEP Install Ollama or LM Studio today, pull liquid/lfm2.5-1.2b, and spend 15 minutes just changing one slider at a time to feel the difference.

Quick Revision

Master Cheat Sheet

Every core reference table from the course, in one place.

Model Size → Hardware

1–2B	4 GB RAM, any laptop
7–9B	16 GB RAM / RTX 3060+
13–14B	24 GB, 12GB+ VRAM GPU
30–34B	32–64 GB, RTX 4090 / Apple Max
70B+	64–128 GB, multi-GPU

Quantization Ladder

FP16	100% size · best quality
Q8_0	~50% size · near-lossless
Q4_K_M	~28% size · sweet spot ✅
Q2_K	~16% size · visible quality loss

Sampling Pipeline Order

Raw Probabilities → Apply Temperature → Apply Top-K → Apply Top-P → Sample

Ollama Essentials

ollama pull <model>
ollama run <model>
ollama list

A–Z Reference

Glossary — Flip to Reveal

Click any card to flip it and see the definition.

Final Assessment

Final Comprehensive Assessment

15 mixed questions spanning all six modules. Your live score is tracked at the top.

Go Further

Resources & Links

Official docs and libraries mentioned in this course.

🦙

ollama.com

Download Ollama & browse the model library

🖥️

lmstudio.ai

Download LM Studio's desktop app

🤗

huggingface.co/models

Browse thousands of open models & quantizations

💻

github.com/ollama/ollama

Ollama source, issues, and Modelfile docs

⚙️

llama.cpp

The inference engine behind many local-AI tools

📚

Large Language Models — overview

Background reading on how these models work