DeepSeek R1

Internals

Posted by Peter Lau on February 9, 2025

Model Architecture

DeepSeek R1 Architecture

MoE Architecture

The expert network activates only a subset of weight parameters at a time, improving GPU resource utilization.

MLA

Latent Vector Compression applies low-rank decomposition to the K and V matrices, significantly reducing KV cache size compared to storing the entire matrix.

Transformer Layer Optimization

Details on optimization techniques for transformer layers.

R1 Production Workflow

DeepSeek R1 Production Workflow

References

  1. DeepSeek R1: Technical Overview of Its Architecture and Innovations
  2. Architecture Advancement on Transformers