Ollama Models - Tokenization Analysis & VRAM Calculator

What This Tool Provides:

Tokenization Analysis: Identifies which tokenization method each Ollama model uses (confirmed through research)
VRAM Calculator: Calculates exact VRAM requirements for different model sizes and context lengths
GPU Recommendations: Suggests appropriate GPUs based on your model and context needs
Efficiency Insights: Shows how tokenization methods affect actual text capacity and memory usage

Understanding tokenization and VRAM requirements is crucial for deploying LLMs effectively. Use this tool to plan your infrastructure and optimize your model selection.

About Tokenization Classification

Different models use various tokenization methods that affect context length calculations:

BPE (Byte Pair Encoding): Used by GPT models, Qwen, Phi, IBM Granite, DeepSeek, Cohere
SentencePiece: Used by LLaMA family, Gemma, T5, and many multilingual models
Tiktoken (BPE variant): Used by newer Mistral models (NeMo+), Phi-4
WordPiece: Used by BERT and some embedding models
Character-level: Some specialized models

Verified Information: This classification is based on official documentation and technical reports from each model provider.

Confirmed Tokenization Methods by Model Family

SentencePiece:

LLaMA (all versions)
Mistral (pre-NeMo)
Gemma (Google)
T5 models

BPE (Byte-Pair Encoding):

Qwen (byte-level BPE)
DeepSeek (byte-level BPE)
Granite/IBM (StarCoder BPE)
Cohere (Command, Aya)
Phi (earlier versions)
StarCoder
Falcon

Tiktoken (BPE variant):

Mistral NeMo & newer
Phi-4
GPT models

WordPiece:

BERT-based models
Some embedding models

VRAM Usage Calculator for Context Length

Accurate VRAM Calculation Formulas:

Model Weights VRAM:


                        Model VRAM (GB) = model_params_B × bytes_per_param / 1024³

• FP16: 2 bytes per parameter
• INT8: 1 byte per parameter
• INT4: 0.5 bytes per parameter

KV Cache (Context) VRAM:


                        KV Cache VRAM (GB) = 2 × num_layers × hidden_dim × context_length × bytes_per_value / 1024³

Simplified approximation based on model size:


                        KV Cache VRAM ≈ context_length × model_size_B × kv_multiplier / 1024

Where kv_multiplier varies by architecture:
• Standard MHA: ~0.125 MB per token per B
• GQA (8 groups): ~0.016 MB per token per B
• MQA: ~0.008 MB per token per B

Technical Note: Modern models often use optimizations like Grouped Query Attention (GQA) or Multi-Query Attention (MQA) which significantly reduce KV cache size. This calculator uses conservative estimates for standard Multi-Head Attention unless specified otherwise.

Tokenization Efficiency Factors:

Different tokenization methods require different numbers of tokens for the same text:

Efficient (1.0x): Tiktoken, Modern BPE (Qwen, DeepSeek) - baseline
Standard (1.1x): SentencePiece (LLaMA, Gemma) - 10% more tokens
Less Efficient (1.2x): Older BPE, multilingual models - 20% more tokens
Least Efficient (1.3x): Character-heavy languages, older models - 30% more tokens

Note: Higher factors mean MORE tokens needed for the same text, thus MORE VRAM usage.

Select Parameters:

Model Size:

Context Length:

Quantization:

Tokenization Efficiency:

Architecture Type:

Quick Reference Table (7B Model, Standard Tokenization, MHA):

Context Length	KV Cache VRAM	Total @ FP16	Total @ INT4

* Total includes: Model weights + KV Cache + Overhead. INT4 quantization only affects model weights.

GPU Recommendations by Use Case:

RTX 4060 Ti (16GB): 7B models @ 4K context (FP16), 13B models @ 4K context (INT4)

RTX 4070 Ti Super (16GB): 7B models @ 8K context (FP16), 13B models @ 8K context (INT4)

RTX 4090 (24GB): 13B models @ 8K context (FP16), 30B models @ 4K context (INT4)

A100 (40GB): 30B models @ 8K context (FP16), 70B models @ 4K context (INT4)

A100 (80GB) / H100: 70B models @ 8K context (FP16), 70B models @ 32K context (INT4)

Note: These recommendations include overhead for inference. Actual performance may vary based on batch size and specific optimizations.

Important Considerations:

Model weights quantization: Reduces model size by 50-87.5% (INT8 to INT2)
KV cache is NOT quantized by default - remains at FP16 precision
Batch processing: Each concurrent request needs its own KV cache
Flash Attention 2: Can reduce memory by up to 20% on supported GPUs
Paged Attention: vLLM and similar can optimize KV cache memory usage
CPU offloading: Can use system RAM but 10-100x slower
Always reserve 10-20% VRAM for PyTorch overhead and activations

Architecture-Specific Memory Optimizations:

Modern architectures use various techniques to reduce memory usage:

Architecture	Optimization	KV Reduction
Standard MHA	None	1x (baseline)
GQA-8	8 key-value groups	~8x reduction
MQA	Single key-value head	~32x reduction
Sliding Window	Limited attention span	Caps at window size

Example: Llama 3 70B uses GQA with 8 groups, reducing its KV cache from ~70GB to ~8.75GB for 32K context!

Ollama Models Directory