FLOWLYTIX AI BOOTCAMP SERIES

Running Small AI Models Locally with Ollama & LM Studio

Choose a model that fits your machine, set it up correctly, and understand temperature, top-K, top-P, and context window — hands-on with LiquidAI's LFM2-1.2B.

6
Modules
1.2B
Model Params
2
Tools Covered
10+
Interactive Tools
bash — local-ai
Module 1 · Why Local

1Why Run AI Locally

Understanding the tradeoffs before you install anything.

Local AI vs. Cloud AI

Same idea — run inference — in two very different places. Neither is universally "better"; they trade off differently depending on what you need.

CLOUD AI
  • Runs on someone else's servers (OpenAI, Anthropic, Google)
  • Access via an API call or web app over the internet
  • Scales to huge models (100B+ params) with no local hardware
  • Usually metered — you pay per token or per subscription
  • Your prompts and data pass through a third party's servers
LOCAL AI
  • Runs entirely on your own laptop or desktop
  • Access via a CLI (Ollama) or desktop app (LM Studio)
  • Limited to smaller models your hardware can actually hold
  • Free to run once downloaded — no per-token billing
  • Nothing leaves your machine — fully private by default

When Local Models Make Sense

Local isn't "better" or "worse" than cloud — it's the right tool for specific situations.

🔒 Privacy-Sensitive Work

Legal, medical, or proprietary business data that should never leave your device.

📡 Offline / Unreliable Internet

Fieldwork, travel, or environments where you can't depend on a connection.

⚡ Low-Latency Needs

No network round-trip — responses start generating immediately.

💰 Cost at Scale

Running thousands of requests without per-token API charges.

🧪 Experimentation & Learning

Freely test prompts, fine-tuning, and parameters without burning credits.

🎯 Custom / Fine-Tuned Models

Running a model you've fine-tuned on your own data or task.

KEY IDEA Local and cloud AI aren't rivals — many practitioners use both, reaching for whichever fits the moment.

🧩 Puzzle: Local or Cloud?

Select a scenario chip below, then click the bucket where it belongs.

Module 2 · Choosing a Model

2Choosing the Right Model

Matching parameter count and quantization to the machine you actually have.

What Does "1.2B" Actually Mean?

The number refers to parameters — the learned weights inside the model. A parameter is one learned number (a weight or bias) inside the network. More parameters generally means more capacity to capture patterns — but also more memory and compute required to run.

Params Tier Typical Hardware Examples
~1–3B Tiny Runs on almost any laptop, even CPU-only LFM2-1.2B, Qwen2.5-1.5B
~7–9B Small Comfortable on a mid-range laptop with 16GB RAM Llama 3.1 8B, Gemma 2 9B
~13–34B Medium Needs a dedicated GPU with 16GB+ VRAM for good speed Mixtral, Qwen2.5 32B
70B+ Large Multi-GPU workstation or server territory Llama 3.1 70B, DeepSeek 67B

Quantization: Shrinking Models to Fit

Quantization stores each weight with fewer bits — smaller file, less memory, faster inference, slight quality loss.

Format Bits / Weight Relative Size Notes
FP16 16 bits 100% Full precision. Highest quality, largest size.
Q8_0 8 bits ~50% Near-lossless quality at half the size.
Q4_K_M ~4.5 bits ~28% The sweet spot most people use locally.
Q2_K ~2.5 bits ~16% Very small, but noticeable quality loss.
RULE OF THUMB Q4_K_M (or Q4_0) is the default most people should reach for — it keeps quality high while roughly quartering the memory footprint of the full model.

CPU vs. GPU Inference

CPU INFERENCE
  • Works on any machine — no special hardware needed
  • Uses system RAM, usually more plentiful than VRAM
  • Slower token generation, especially on larger models
  • The default fallback in both Ollama and LM Studio

Best for: small models (1–4B) or machines without a dedicated GPU.

GPU INFERENCE
  • Needs a compatible GPU (NVIDIA CUDA, Apple Metal, AMD ROCm)
  • Model must fit in VRAM — the GPU's dedicated fast memory
  • Dramatically faster token generation via massive parallelism
  • Both tools auto-detect and offload layers to GPU when available

Best for: 7B+ models, or fast, snappy responses for daily use.

🛠️ Interactive Tool: Hardware Matchmaker

Drag the sliders to describe your machine — we'll recommend a model tier and quantization to start with.

Module 3 · Tools

3Ollama & LM Studio

The two most popular ways to run models locally — one CLI, one GUI.

OLLAMA

A lightweight, command-line-first runtime

  • Runs as a background service; interact via terminal or its API
  • Ships models as versioned "model names" — pull like Docker images
  • Uses a Modelfile system to configure prompts & parameters
  • Exposes a local REST API (default port 11434)
  • Extremely fast to get started: install → one command → chatting
Interface Command line + local API
Platforms macOS, Windows, Linux
Best for Developers, scripting, automation
License Free & open-source
LM STUDIO

A polished desktop app for local models

  • A full graphical app — browse, download, chat visually
  • Built-in model search with hardware compatibility indicators
  • Chat UI with sliders for temperature, top-K, top-P, context
  • Can run a local server mimicking the OpenAI API format
  • Great for beginners who want to see and tune everything
Interface Desktop GUI + optional local API
Platforms macOS, Windows, Linux
Best for Beginners, visual experimentation
License Free to use

Side by Side

Ollama LM Studio
Interface Command line Graphical desktop app
Learning curve A little steeper — needs terminal comfort Very beginner-friendly
Parameter tuning Via Modelfile or API request Live sliders in the chat UI
Scripting / automation Excellent — built for it Possible via its local server
Model discovery ollama.com library, pull by name Built-in search with compatibility hints
Resource overhead Minimal, runs as a lean service Slightly heavier (full desktop app)

▶️ Try It: Installing & Setting Up Ollama

bash — setup
SETTING UP LM STUDIO (NO TERMINAL NEEDED) 1. Download from lmstudio.ai and install like any desktop app  ·  2. Search for a model by name — LM Studio flags what fits your hardware  ·  3. Pick Q4_K_M and click Download  ·  4. Open the Chat tab, load the model, and start typing.
Module 4 · Hands-On

4Hands-On: LFM2-1.2B

Downloading and talking to our test model in both tools.

MODEL CARD — liquid/lfm2.5-1.2b

LFM2 is LiquidAI's "Liquid Foundation Model" family, built around a hybrid architecture designed for fast, efficient inference on everyday hardware — not just data-center GPUs.

Parameters 1.2 billion
Typical size (Q4) ~700 MB – 1 GB
Minimum RAM ~2 GB free
Runs well on CPU-only laptops
Good for Chat, summarizing, quick drafting

✔️ Interactive Checklist — Running It in Ollama

IN LM STUDIO INSTEAD Search "lfm2.5-1.2b" → pick the Q4_K_M version (green fit indicator) → Download → open the Chat tab → select the model to load it → type your message. Sliders for temperature, top-K, top-P, and context length sit on the right.

What Just Happened? (Puzzle)

You just sent your first local message. Put these five steps back in the correct order using the arrows.

Module 5 · Parameters

5Generation Parameters Explained

The dials that shape how a model picks its next word.

At every step, the model outputs a probability for every possible next token. Temperature reshapes that distribution, top-K throws away everything except the K most likely tokens, and top-P keeps adding tokens until their probabilities add up to P. Most tools apply all three, in that order, every single token.

🎛️ Live Simulator: Temperature × Top-K × Top-P

Prompt: "The cat sat on the ___" — drag the sliders and watch the candidate pool reshape in real time.
NUCLEUS SAMPLING Top-P is called "nucleus" sampling because the candidate pool shrinks automatically when the model is confident (one word dominates) and grows automatically when it's uncertain — top-K can't adapt like that.

The Full Pipeline (Puzzle)

Every token goes through these five stages, in order. Reorder them correctly.

FULLY DETERMINISTIC MODE Set temperature → 0, top-K → 1, or top-P → very low, and generation becomes fully deterministic — the same prompt always gives the same output.

Context Window: The Model's Working Memory

The maximum number of tokens (prompt + conversation + response) the model can "see" at once. Once this window fills up, the oldest tokens must be dropped to make room for new ones — longer conversations get "forgotten."

💬 Interactive: Context Window Explorer

Window size fixed at 4,096 tokens · ~250 tokens per exchange. Drag to simulate a growing conversation.

Recommended Settings Cheat Sheet

Task Temperature Top-K Top-P
Factual Q&A / Coding 0.1 – 0.3 20 – 40 0.8 – 0.9
General Chat / Assistant 0.5 – 0.7 40 – 60 0.9 – 0.95
Creative Writing 0.8 – 1.1 60 – 100 0.92 – 0.98
Brainstorming / Ideation 1.0 – 1.3 80 – 120 0.95 – 1.0
FOR LFM2-1.2B SPECIFICALLY Start with temperature 0.6–0.7, top-K 40, top-P 0.9 for general use. Being a smaller model, it benefits from staying a bit more conservative than you might with a larger 7B+ model.
Module 6 · Troubleshooting

6Tips & Troubleshooting

The handful of issues almost everyone hits in their first week. Click a symptom to reveal the fix.

Course Recap

  • Local AI trades cloud scale for privacy, offline access, and zero per-token cost.
  • Model size (parameters) and quantization determine what fits on your machine.
  • Ollama gives you a fast CLI workflow; LM Studio gives you a visual, beginner-friendly one.
  • LFM2-1.2B is small enough to run comfortably on almost any laptop, CPU-only included.
  • Temperature, top-K, and top-P are applied in sequence to shape how tokens get sampled.
  • The context window is finite working memory — it fills up and older tokens get dropped.
NEXT STEP Install Ollama or LM Studio today, pull liquid/lfm2.5-1.2b, and spend 15 minutes just changing one slider at a time to feel the difference.
Quick Revision

Master Cheat Sheet

Every core reference table from the course, in one place.

Model Size → Hardware

1–2B 4 GB RAM, any laptop
7–9B 16 GB RAM / RTX 3060+
13–14B 24 GB, 12GB+ VRAM GPU
30–34B 32–64 GB, RTX 4090 / Apple Max
70B+ 64–128 GB, multi-GPU

Quantization Ladder

FP16 100% size · best quality
Q8_0 ~50% size · near-lossless
Q4_K_M ~28% size · sweet spot ✅
Q2_K ~16% size · visible quality loss

Sampling Pipeline Order

Raw Probabilities → Apply Temperature → Apply Top-K → Apply Top-P → Sample

Ollama Essentials

ollama pull <model>
ollama run <model>
ollama list

A–Z Reference

Glossary — Flip to Reveal

Click any card to flip it and see the definition.

Final Assessment

Final Comprehensive Assessment

15 mixed questions spanning all six modules. Your live score is tracked at the top.