Ollama Models Directory

Model Analysis for Tokenization Method Classification - With Accurate VRAM Calculations

Note: Due to CORS restrictions, this page displays a snapshot of Ollama models. To refresh data, use a backend proxy or browser extension to bypass CORS.

What This Tool Provides:

Understanding tokenization and VRAM requirements is crucial for deploying LLMs effectively. Use this tool to plan your infrastructure and optimize your model selection.

About Tokenization Classification

Different models use various tokenization methods that affect context length calculations:

Verified Information: This classification is based on official documentation and technical reports from each model provider.

0
Total Models
0
Total Pulls
0
Model Families

Confirmed Tokenization Methods by Model Family

SentencePiece:

  • LLaMA (all versions)
  • Mistral (pre-NeMo)
  • Gemma (Google)
  • T5 models

BPE (Byte-Pair Encoding):

  • Qwen (byte-level BPE)
  • DeepSeek (byte-level BPE)
  • Granite/IBM (StarCoder BPE)
  • Cohere (Command, Aya)
  • Phi (earlier versions)
  • StarCoder
  • Falcon

Tiktoken (BPE variant):

  • Mistral NeMo & newer
  • Phi-4
  • GPT models

WordPiece:

  • BERT-based models
  • Some embedding models

VRAM Usage Calculator for Context Length

Accurate VRAM Calculation Formulas:

Model Weights VRAM:
Model VRAM (GB) = model_params_B × bytes_per_param / 1024³

• FP16: 2 bytes per parameter
• INT8: 1 byte per parameter
• INT4: 0.5 bytes per parameter

KV Cache (Context) VRAM:
KV Cache VRAM (GB) = 2 × num_layers × hidden_dim × context_length × bytes_per_value / 1024³

Simplified approximation based on model size:

KV Cache VRAM ≈ context_length × model_size_B × kv_multiplier / 1024

Where kv_multiplier varies by architecture:
• Standard MHA: ~0.125 MB per token per B
• GQA (8 groups): ~0.016 MB per token per B
• MQA: ~0.008 MB per token per B

Technical Note: Modern models often use optimizations like Grouped Query Attention (GQA) or Multi-Query Attention (MQA) which significantly reduce KV cache size. This calculator uses conservative estimates for standard Multi-Head Attention unless specified otherwise.

Tokenization Efficiency Factors:

Different tokenization methods require different numbers of tokens for the same text:

  • Efficient (1.0x): Tiktoken, Modern BPE (Qwen, DeepSeek) - baseline
  • Standard (1.1x): SentencePiece (LLaMA, Gemma) - 10% more tokens
  • Less Efficient (1.2x): Older BPE, multilingual models - 20% more tokens
  • Least Efficient (1.3x): Character-heavy languages, older models - 30% more tokens

Note: Higher factors mean MORE tokens needed for the same text, thus MORE VRAM usage.

Select Parameters:

Quick Reference Table (7B Model, Standard Tokenization, MHA):

Context Length KV Cache VRAM Total @ FP16 Total @ INT4

* Total includes: Model weights + KV Cache + Overhead. INT4 quantization only affects model weights.

GPU Recommendations by Use Case:

RTX 4060 Ti (16GB): 7B models @ 4K context (FP16), 13B models @ 4K context (INT4)
RTX 4070 Ti Super (16GB): 7B models @ 8K context (FP16), 13B models @ 8K context (INT4)
RTX 4090 (24GB): 13B models @ 8K context (FP16), 30B models @ 4K context (INT4)
A100 (40GB): 30B models @ 8K context (FP16), 70B models @ 4K context (INT4)
A100 (80GB) / H100: 70B models @ 8K context (FP16), 70B models @ 32K context (INT4)

Note: These recommendations include overhead for inference. Actual performance may vary based on batch size and specific optimizations.

Important Considerations:

  • Model weights quantization: Reduces model size by 50-87.5% (INT8 to INT2)
  • KV cache is NOT quantized by default - remains at FP16 precision
  • Batch processing: Each concurrent request needs its own KV cache
  • Flash Attention 2: Can reduce memory by up to 20% on supported GPUs
  • Paged Attention: vLLM and similar can optimize KV cache memory usage
  • CPU offloading: Can use system RAM but 10-100x slower
  • Always reserve 10-20% VRAM for PyTorch overhead and activations

Architecture-Specific Memory Optimizations:

Modern architectures use various techniques to reduce memory usage:

Architecture Optimization KV Reduction
Standard MHA None 1x (baseline)
GQA-8 8 key-value groups ~8x reduction
MQA Single key-value head ~32x reduction
Sliding Window Limited attention span Caps at window size

Example: Llama 3 70B uses GQA with 8 groups, reducing its KV cache from ~70GB to ~8.75GB for 32K context!

Legend:
Confirmed Tokenization method verified through official documentation
Likely Tokenization method inferred from architecture patterns