Nexan Insights
Posts
DeepSeek’s Hardware Innovations

DeepSeek’s Hardware Innovations

Overcoming H800 Constraints at Scale

Ajit Banerjee
January 29, 2025

This dataset highlights DeepSeek’s performance optimizations, comparing H800 vs. H100 GPUs, CUDA vs. PTX programming, and model-level techniques for overcoming hardware constraints.

Recent discussions around DeepSeek have centered on the remarkable cost efficiency of their large language model (LLM) training runs and their ability to match—or even exceed—state-of-the-art model performance. While most of the headlines emphasize DeepSeek’s architectural breakthroughs (Mixture of Experts, Multi-Head Latent Attention, etc.), the hidden story is their astonishing level of hardware and systems optimization. DeepSeek trained a highly capable, multi-hundred-billion parameter model (DeepSeek-V3) on a cluster of Nvidia H800 GPUs—a version of the Hopper GPU family that is restricted in memory bandwidth thanks to U.S. export controls.

This article takes a technical look at how DeepSeek managed to push H800 hardware well beyond what most AI practitioners assumed possible. We’ll walk through three interlocking pieces of DeepSeek’s approach:

Cluster Architecture & GPU Interconnect Optimizations
Low-Level PTX Customizations
Model-Level Techniques to Amortize the H800’s Bandwidth Constraints
Putting It All Together

1. Cluster Architecture & GPU Interconnect Optimizations

The H800 Bottleneck

By design, Nvidia’s H800 has significantly reduced memory bandwidth compared to the top-tier H100. While the same overall GPU architecture (Hopper) underpins both models, the cut-down bandwidth is intended to comply with U.S. government export controls, making the H800 far less ideal for large-scale training. Typically, lower memory bandwidth means slower data movement between GPUs (and between GPU memory and GPU compute cores), resulting in poor scaling when you try to spin up extremely large training clusters.

DeepSeek, however, not only used H800s in a large cluster but also achieved a training time that rivaled (and in some respects beat) typical large-scale H100-based runs. According to DeepSeek’s own paper, they trained their DeepSeek-V3 model—14.8 trillion tokens processed—at a total cost of 2.8 million H800 GPU hours. That is stunningly low, especially on hardware that was presumed suboptimal for HPC workloads.

Table 1: Performance and Cost Comparison of H800 and H100 GPUs

Dedicated Communication Units

The first trick: DeepSeek reprogrammed a subset of Streaming Multiprocessors (SMs) on each H800—20 out of the GPU’s 132 units—to handle inter-GPU communications. Typically, HPC clusters rely on tight coupling between GPUs (e.g., NVLink topologies, InfiniBand networks) with minimal user-side tinkering beyond vendor-provided frameworks like NCCL. DeepSeek’s approach instead carved out specific GPU SMs exclusively for cross-GPU data routing. This is not the normal HPC pattern.

By assigning 20 SMs to communication tasks, data shards or model “experts” can rapidly sync needed gradient or activation information between GPUs. It offloads and parallelizes communication overhead so that actual compute SMs can keep working rather than idling.

Going Beyond CUDA

Under the hood, this custom partitioning required dropping below CUDA into PTX (Parallel Thread Execution), Nvidia’s low-level GPU intermediate assembly language. CUDA proper typically does not let developers reserve entire SMs for specialized tasks or directly orchestrate the cross-SM dispatch in “bare metal” fashion. DeepSeek’s PTX-level modifications let them:

Manage cross-chip data flows with finer granularity.
Minimize communication stalls by effectively scheduling data transfers in parallel with forward/backward passes on the rest of the chip.

Implications

Training Efficiency: DeepSeek’s cluster design keeps GPUs saturated with useful work and drastically reduces idle time.
Scalability: It becomes easier to scale out to tens of thousands of H800 GPUs, because each GPU handles a portion of interconnect overhead internally.
Hardware-Led Software Rethink: Nearly every HPC shop relies on automatic GPU collectives (NCCL, MPI). DeepSeek demonstrates how a small team with sufficient low-level GPU knowledge can re-engineer HPC communication patterns in ways that standard libraries do not.

2. Low-Level PTX Customizations

PTX as GPU Assembly

Nvidia’s tooling ecosystem has an additional layer below CUDA: the PTX intermediate representation. Think of PTX as assembly language for Nvidia GPUs. While CUDA is a high-level C++-like language with GPU semantics (kernels, threads, blocks), PTX gives near-total control over how instructions map to registers, SMs, and the warp scheduler.

Table 2: Comparison of CUDA (High-Level) and PTX (Low-Level) Programming

DeepSeek’s approach is hyper-optimized to the H800’s specific memory bandwidth limitations. They can selectively reorder instructions and do specialized data staging that a typical CUDA compiler would not do (or would actively optimize away in a general-purpose fashion).

Overcoming H800 Memory Bandwidth Limits

When memory bandwidth is restricted, the bottleneck is no longer floating-point throughput per se; it’s how fast you can move data in and out of GPU cores. DeepSeek overcame this by:

Customized Caching and Prefetch: Using PTX instructions to hint how data is staged in GPU caches (L1, L2) before compute kernels need it.
Streamlined FP8 Pipelines: DeepSeek’s model uses 8-bit floating-point (FP8) for calculations. FP8 drastically reduces how many bytes must move across the bus each cycle; with PTX, they can carefully handle rounding and accumulation to avoid normal precision-lowering pitfalls.
Dedicated SMs for Data Transfer (noted above): In PTX, they orchestrate these “comm SMs” to keep data movement from overshadowing core compute.

The 80/20 (or 132/20) Principle

By dedicating a fraction of the GPU’s SMs to communications, DeepSeek effectively turned each H800 into a mini-distributed node with built-in routing logic. The tradeoff is fewer SMs for raw matrix multiplication on each GPU, but apparently the net effect is far more net FLOPs actually used across the cluster. In HPC terms: better scaling efficiency + fewer idle cycles = massive real-world gains.

3. Model-Level Techniques to Amortize the H800’s Bandwidth Constraints

DeepSeekMoE: Fine-Grained Mixture of Experts

DeepSeek’s MoE (Mixture of Experts) approach is twofold:

Expert “Shards”: Instead of activating a single, monolithic 200B+ parameter block for every token, DeepSeekMoE routes each token to a subset of “experts.” V3 has a total of 671B parameters, but only 37B get activated per token.
Load Balancing: They also refined how the training pipeline routes tokens across experts so as to keep GPUs loaded evenly. A naive MoE approach can cause significant overhead in communication—and large swaths of the model might sit idle for certain token classes. DeepSeek’s specialized load balancing ensures that each GPU processes an equal fraction of tokens, each time.

DeepSeek’s Hardware Innovations: Overcoming H800 Constraints with Expert Sharding and Load Balancing

Why it matters for hardware:

The largest driver of GPU memory usage during inference (or training) is the immediate set of parameters in use for a given forward/backward pass. By restricting usage to just one or two experts at a time, the memory footprint becomes more manageable, which compensates for H800’s narrower memory channels.
The required cross-GPU communication to gather relevant “expert blocks” for each token is carefully distributed. This synergy with the hardware partitioning (20 SMs for comm) is what truly unlocks MoE’s efficiency.

DeepSeekMLA: Multi-Head Latent Attention

Large context windows are notoriously demanding on GPU memory bandwidth. Every single token typically includes a key and a value that must be stored for cross-attention. DeepSeek’s MLA technique compresses these key-value pairs in real time. Instead of storing full precision vectors, they:

Project them into a smaller latent space.
Only expand them back out when needed in subsequent attention steps.

This drastically cuts the read/write overhead. On H800s, it further frees up precious memory bandwidth for essential compute. By extension, the usable context size is larger without ballooning GPU memory usage.

Multi-Token Training / Denser Training Steps

Another widely noted optimization is DeepSeek’s multi-token prediction during each training step. By grouping or “batching” multiple tokens for parallel updates, they effectively densify each training iteration. This synergy again helps the cluster maximize throughput on limited bandwidth: each “communication round” accomplishes more training work.

4. Putting It All Together

DeepSeek’s hardware story is not about exotic new GPUs or specialized silicon. Instead, it’s a lesson in “inverted HPC”:

Constrained GPU → Extraordinary Optimization
- DeepSeek was forced to use H800s due to export controls. That constraint inspired them to re-architect how an LLM is trained at every level—model design, training pipeline, cluster communications, and even register-level instruction scheduling.
Low-Level Control → Near-Optimal Resource Utilization
- By manually tuning PTX, DeepSeek overcame dataflow bottlenecks. They allocated 20 SMs for communications and adopted advanced caching/prefetching routines that standard CUDA-based HPC shops rarely attempt.
Model & Hardware Co-Design → Scalable and Cheap
- Mixture of Experts, multi-head latent attention, and multi-token training steps all reduce the demands on bandwidth. These model-level innovations dovetail with hardware optimizations to exploit every last FLOP the H800 provides.

DeepSeek’s Optimization Strategies: Co-Designing Model & Hardware, Low-Level PTX Control, and GPU Constraints Workarounds to Maximize H800 Efficiency.

With these strategies, DeepSeek turned an ostensibly “handicapped” GPU into a workable foundation for cutting-edge large language models, achieving shockingly low training costs per token. The result is a potent demonstration that “more advanced hardware” (e.g., H100s) is not the only path to top-tier AI performance. In the wake of DeepSeek’s methods, questions loom about whether HPC clusters worldwide could adopt similar low-level optimizations—even on stronger GPUs—to further push the boundaries of LLM training efficiency.

DeepSeek’s emphasis on hardware–software co-design sets a new gold standard for large-scale AI. Their success—under hardware constraints that many considered untenable—suggests that HPC progress in AI is not only about the biggest memory bandwidth or the densest GPU cluster; it is also about the readiness to dive beneath standard software abstractions and orchestrate the entire stack from PTX up to the model graph.

Whether or not this approach is generalizable remains to be seen. For many organizations, it may be more cost-effective to pay for higher-bandwidth GPUs and rely on standard CUDA. But DeepSeek’s work proves that with enough ingenuity, even “nerfed” hardware can be leveraged to train world-class AI systems.

As LLM training scales continue to skyrocket, the industry will doubtless learn from DeepSeek’s blueprint. Expect to see more partial hardware customizations, deeper compiler-level hacks, and further evolutions of mixture-of-experts architectures designed to let HPC clusters saturate GPU compute despite persistent memory bandwidth bottlenecks. DeepSeek’s approach may well herald a new era of extreme HPC optimization—reminding us that competitive advantage in AI is at least as much about specialized software engineering as it is about raw hardware horsepower.