Model Architecture
MoE Architecture
The expert network activates only a subset of weight parameters at a time, improving GPU resource utilization.
MLA
Latent Vector Compression applies low-rank decomposition to the K and V matrices, significantly reducing KV cache size compared to storing the entire matrix.
Transformer Layer Optimization
Details on optimization techniques for transformer layers.
R1 Production Workflow
References
- DeepSeek R1: Technical Overview of Its Architecture and Innovations
- Architecture Advancement on Transformers