Deploying Deepseek models on Intel® Gaudi® accelerators using vLLM

Deepseek is a model that utilizes Deepseek Mixture of Experts (MoE) and Multi-Head Latent Attention (MLA). Weights are natively stored in FP8 with block quantization scales.It comes in two forms: V3, which is a standard model, and R1, which is a reasoning model that has the same architecture and memory footprintIt can be run on both Intel Gaudi2 and Intel Gaudi3

Ce contenu a été publié dans Non classé. Vous pouvez le mettre en favoris avec ce permalien.