AI Model Taxonomy

Right … AI acronyms. I started off looking at LLMs (Large Language Models), but now I also look at SLMs, LAMs, VLAs, and MoEs.

Are these just marketing buzzwords? Sometimes. But usually, they describe a fundamental shift in what the AI is actually doing. We are moving from models that just “chat” to models that can “see,” “hear,” “act,” and—crucially—remember.

Here is the definitive taxonomy—the technical breakdown of the models and systems actually running the streets in 2026.

1. The Core Models (The Brains)

Acronym	Full Name	Input(s)	Output(s)	The Technical Distinction
LLM	Large Language Model	Text, Code	Text, Code	The Baseline. Uses Transformer architecture to predict the next token. Purely semantic; it has no eyes or ears, only text data.
SLM	Small Language Model	Text	Text	The Edge Runner. Optimised for phones/laptops (usually <7B parameters). Sacrifices some reasoning depth for privacy, speed, and offline capability.
VLM	Vision-Language Model	Images + Text	Text	The Eyeball. An LLM with a “Vision Encoder” (like CLIP) bolted on. It “tokenises” images into a language the LLM understands, allowing you to chat about pixels.
MLLM	Multimodal Large Language Model	Text, Audio, Video, Image	Text, Audio, Image	The Godzilla. The evolution of VLMs. It doesn’t just “see” static images; it processes time-series data (video/audio) natively, effectively “watching” and “listening” in real-time.
LAM	Large Action Model	User Intent (Text/Voice)	Actions (JSON, API calls, Clicks)	The Agent. Designed for doing. Instead of writing a poem, it outputs executable actions (e.g., navigating a UI, clicking buttons, calling an API) to perform work.
VLA	Vision-Language-Action	Visual Feed + Text Instruction	Robot Actuation	The Robot Brain. It takes a camera feed and a command (“Pick up the apple”) and outputs direct motor controls (joint angles/velocity) rather than text.

2. The Architecture & Memory (The Systems)

This is where the magic happens for businesses. These aren’t just “models”; they are strategies for making models smarter and faster.

Acronym	Full Name	The Technical Distinction
RAG	Retrieval-Augmented Generation	The Librarian. Not a model weight, but a system. It connects a frozen LLM to your private data (Vector Database). Before answering, it “looks up” the facts. This is how you stop hallucinations.
CAG	Cache-Augmented Generation	The RAM Upgrade. The evolution of RAG. Instead of searching a database, you load the entire dataset (books, codebases, videos) into the model’s massive context window (Concept Caching). It eliminates the search step entirely.
MoE	Mixture of Experts	The Efficiency Hack. Instead of one giant brain, it uses multiple smaller “expert” sub-models. A “router” sends your prompt only to the relevant expert (e.g., the Coding Expert), saving massive compute.
LCM	Latent Consistency Model	The Speed Demon. A hyper-fast variation of Diffusion models. It creates high-quality images in huge leaps (1-4 steps) rather than the slow 50+ steps of traditional Stable Diffusion.

Why This Matters

The most important shift in this table is the move from Language to Action and Memory.

LLMs are thinkers. They live in a box and dream up text.
RAG/CAG gives them memory. It lets them read your company handbook.
LAMs/VLAs give them hands. It lets them actually do the work.

We aren’t just building chatbots anymore; we are building digital employees and physical labourers.

Learn More

For a deeper visual breakdown of how these architectures stack up against each other, check out this excellent explainer:

Watch: Every Type of AI Model Explained Clearly