Right … AI acronyms. I started off looking at LLMs (Large Language Models), but now I also look at SLMs, LAMs, VLAs, and MoEs.
Are these just marketing buzzwords? Sometimes. But usually, they describe a fundamental shift in what the AI is actually doing. We are moving from models that just “chat” to models that can “see,” “hear,” “act,” and—crucially—remember.
Here is the definitive taxonomy—the technical breakdown of the models and systems actually running the streets in 2026.
1. The Core Models (The Brains)
| Acronym | Full Name | Input(s) | Output(s) | The Technical Distinction |
| LLM | Large Language Model | Text, Code | Text, Code | The Baseline. Uses Transformer architecture to predict the next token. Purely semantic; it has no eyes or ears, only text data. |
| SLM | Small Language Model | Text | Text | The Edge Runner. Optimised for phones/laptops (usually <7B parameters). Sacrifices some reasoning depth for privacy, speed, and offline capability. |
| VLM | Vision-Language Model | Images + Text | Text | The Eyeball. An LLM with a “Vision Encoder” (like CLIP) bolted on. It “tokenises” images into a language the LLM understands, allowing you to chat about pixels. |
| MLLM | Multimodal Large Language Model | Text, Audio, Video, Image | Text, Audio, Image | The Godzilla. The evolution of VLMs. It doesn’t just “see” static images; it processes time-series data (video/audio) natively, effectively “watching” and “listening” in real-time. |
| LAM | Large Action Model | User Intent (Text/Voice) | Actions (JSON, API calls, Clicks) | The Agent. Designed for doing. Instead of writing a poem, it outputs executable actions (e.g., navigating a UI, clicking buttons, calling an API) to perform work. |
| VLA | Vision-Language-Action | Visual Feed + Text Instruction | Robot Actuation | The Robot Brain. It takes a camera feed and a command (“Pick up the apple”) and outputs direct motor controls (joint angles/velocity) rather than text. |
2. The Architecture & Memory (The Systems)
This is where the magic happens for businesses. These aren’t just “models”; they are strategies for making models smarter and faster.
| Acronym | Full Name | The Technical Distinction |
| RAG | Retrieval-Augmented Generation | The Librarian. Not a model weight, but a system. It connects a frozen LLM to your private data (Vector Database). Before answering, it “looks up” the facts. This is how you stop hallucinations. |
| CAG | Cache-Augmented Generation | The RAM Upgrade. The evolution of RAG. Instead of searching a database, you load the entire dataset (books, codebases, videos) into the model’s massive context window (Concept Caching). It eliminates the search step entirely. |
| MoE | Mixture of Experts | The Efficiency Hack. Instead of one giant brain, it uses multiple smaller “expert” sub-models. A “router” sends your prompt only to the relevant expert (e.g., the Coding Expert), saving massive compute. |
| LCM | Latent Consistency Model | The Speed Demon. A hyper-fast variation of Diffusion models. It creates high-quality images in huge leaps (1-4 steps) rather than the slow 50+ steps of traditional Stable Diffusion. |
Why This Matters
The most important shift in this table is the move from Language to Action and Memory.
- LLMs are thinkers. They live in a box and dream up text.
- RAG/CAG gives them memory. It lets them read your company handbook.
- LAMs/VLAs give them hands. It lets them actually do the work.
We aren’t just building chatbots anymore; we are building digital employees and physical labourers.
Learn More
For a deeper visual breakdown of how these architectures stack up against each other, check out this excellent explainer:

