Subscribe to get weekly email with the most promising tools 🚀

DeepEP-image-0
DeepEP-image-1
DeepEP-image-2

Description

DeepEP is a communication library designed specifically for Mixture of Experts (MoE) and expert parallelism (EP). It provides high-throughput and low-latency all-to-all GPU kernels, optimized for both training and inference tasks. The library supports low-precision operations, including FP8, and features kernels optimized for asymmetric-domain bandwidth forwarding, making it suitable for various GPU architectures and network configurations.

How to use DeepEP?

To use DeepEP, install the required dependencies including NVSHMEM, and import the library into your Python project. Configure the communication buffers and set the number of streaming multiprocessors (SMs) to use. Utilize the provided functions for dispatching and combining operations during model training or inference.

Core features of DeepEP:

1️⃣

High-throughput and low-latency GPU kernels for MoE and EP

2️⃣

Support for low-precision operations including FP8

3️⃣

Optimized for asymmetric-domain bandwidth forwarding

4️⃣

Low-latency kernels for inference decoding

5️⃣

Hook-based communication-computation overlapping method

Why could be used DeepEP?

#Use caseStatus
# 1Model training using normal kernels
# 2Inference prefilling phase
# 3Latency-sensitive inference decoding

Who developed DeepEP?

DeepEP is developed by a team of researchers and engineers, including Chenggang Zhao, Shangyan Zhou, Liyue Zhang, and others, who are focused on advancing communication libraries for efficient expert-parallel processing in deep learning applications.

FAQ of DeepEP