We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem! By integrating Mooncake’s high-performance KVCache transfer and storage capabilities with PyTorch-native inference engines like SGLang, and vLLM, and TensorRT-LLM, we are unlocking new levels of throughput and scalability for large language model deployments.
To view the PyTorch Ecosystem, see the PyTorch Landscape. Learn more about how projects can join the PyTorch Ecosystem.
About Mooncake
Mooncake is designed to solve the “memory wall” in LLM serving. As context lengths grow and models scale, the static binding of Key-Value (KV) cache to specific GPU workers becomes a primary bottleneck.
Mooncake empowers inference engines to break this binding, unlocking four critical capabilities:
- (Encoder) Prefill-Decode Disaggregation: Mooncake’s high-performance Mooncake Transfer Engine separates heavy computation (prefill/encoder) from latency-sensitive generation (decoding) into distinct clusters.
- Global KVCache Reuse: By acting as a distributed shared memory for KV blocks, Mooncake Store enables valid cache to be reused globally across different requests and engine instances.
- Elastic Expert Parallelism: By decoupling experts from specific workers, Mooncake-EP enables elastic and resilient serving where experts of Mixture-of-Experts (MoE) models can be dynamically routed or recovered, ensuring high availability even during partial node failures.
- PyTorch Distributed Backend: Mooncake Backend serves as a fault-tolerant PyTorch distributed backend. It provides robust collective communication primitives capable of continuing operation seamlessly in the presence of rank failures.
- Weighs Updating: Mooncake Store enables rapid weight updates for RL and checkpoint scenarios by storing weights internally. It offers tensor-native and zero-copy APIs.
Wide Industry Adoption
Mooncake originated from a research collaboration between Moonshot AI and Tsinghua University. It was born from the need to solve the “memory wall” in serving massive-scale models like Kimi. Since open-sourcing, it has evolved into a thriving community-driven project.
Mooncake’s architecture has been battle-tested in some of the world’s most demanding production environments. Its ability to decouple compute from memory has led to wide adoption across leading organizations, including Moonshot AI (Kimi), Alibaba Cloud, Ant Group, JD.com, Tencent, Meituan, Approaching.AI and LightSeek Foundation.
These organizations utilize Mooncake to maximize GPU utilization and ensure smooth serving for millions of concurrent users.
In Action: A Joint Solution
To demonstrate the full potential of this architecture, we present a joint solution that combines Mooncake with the ecosystem’s leading inference engines and orchestration tools.
In this architecture, we will use RoleBasedGroup (RBG, https://github.com/sgl-project/rbg) to orchestrate the entire topology, defining the relationships and startup order of the cluster. It deploys Shepherd Model Gateway (SMG, https://github.com/lightseekorg/smg) as the critical routing layer, which intelligently directs incoming requests to the appropriate workers based on cache locality and system load. The heavy lifting is then performed by SGLang (https://github.com/sgl-project/sglang) or vLLM (https://github.com/vllm-project/vllm) instances serving as compute workers, while Mooncake functions as the high-speed data plane: its Transfer Engine pushes prefilled KV cache via RDMA/NVLink, and its Store persists that cache for global reuse by decoding nodes.
1. Deployment with SGLang + Mooncake + SMG
Below is the RBG configuration that immediately deploys a complete SGLang architecture. In this case, both Prefill-Decode Disaggregation and Global KVCache-Reuse are enabled. The Prefill instances utilize Mooncake TE to transfer kvcache to Decode instances, while Mooncake Store facilitates reusing KVCache across different requests within the Prefill instance (more details in KEP-74 Mooncake Integration and pd-disaggregated-with-mooncake.yaml).
YAML # Joint Solution: RBG + SMG + SGLang + Mooncake (Production Ready) apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup metadata: name: sglang-mooncake-smg-v2 spec: roles: # 1. Mooncake Master: Centralized Metadata Server for TE and Store - name: mooncake-master replicas: 1 template: spec: containers: - name: master image: lmsysorg/sglang:latest env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP command: ["mooncake_master"] args: - --enable_http_metadata_server=true - --rpc_address=$(POD_IP) - --rpc_port=50051 - --http_metadata_server_host=$(POD_IP) - --http_metadata_server_port=8080 - --metrics_port=9003 # 2. Mooncake Store: Distributed KVCache Storage Nodes - name: mooncake-store replicas: 3 dependencies: ["mooncake-master"] template: spec: containers: - name: store-node image: lmsysorg/sglang:latest env: - name: MOONCAKE_MASTER value: "s-sglang-mooncake-smg-v2-mooncake-master:50051" - name: MOONCAKE_TE_META_DATA_SERVER value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata" - name: MOONCAKE_GLOBAL_SEGMENT_SIZE value: "45gb" - name: MOONCAKE_PROTOCOL value: "rdma" # Use RDMA for zero-copy KVCache transfer command: ["python3", "-m", "mooncake.mooncake_store_service"] resources: limits: memory: "50Gi" rdma/hca: 1 # Required for high-speed TE transfer requests: memory: "50Gi" rdma/hca: 1 # 3. Prefill Worker (SGLang): High-throughput Prefill with Mooncake Push - name: prefill-worker replicas: 1 dependencies: ["mooncake-master", "mooncake-store"] template: spec: containers: - name: sglang-prefill image: lmsysorg/sglang:latest env: - name: MOONCAKE_MASTER value: "s-sglang-mooncake-smg-v2-mooncake-master:50051" - name: MOONCAKE_TE_META_DATA_SERVER value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata" - name: MOONCAKE_PROTOCOL value: "rdma" command: - python3 - -m - sglang.launch_server - --model-path /models/Qwen3 - --tp 4 - --disaggregation-mode prefill - --disaggregation-transfer-backend mooncake # Activates Mooncake TE for KVCache Push - --enable-hierarchical-cache # Enables KVCache offloading - --hicache-storage-backend mooncake # Uses Mooncake as the L2/L3 cache backend resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 # 4. Decode Worker (SGLang): Low-latency Generation with Mooncake Pull - name: decode-worker replicas: 2 dependencies: ["mooncake-master", "prefill-worker"] template: spec: containers: - name: sglang-decode image: lmsysorg/sglang:latest command: - python3 - -m - sglang.launch_server - --model-path /models/Qwen3 - --tp 4 - --disaggregation-mode decode # Pulls shared KVCache from Mooncake Store resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 # 5. Shepherd Model Gateway (SMG): Intelligent PD-Disaggregation Router - name: smg-router replicas: 1 dependencies: ["prefill-worker", "decode-worker"] template: spec: containers: - name: router image: lightseekorg/smg:latest command: - smg - --pd-disaggregation - --prefill http://s-sglang-mooncake-smg-v2-prefill-worker:8000 - --decode http://s-sglang-mooncake-smg-v2-decode-worker:8000 - --host 0.0.0.0 - --port 8000
2. Deployment with vLLM + Mooncake
vLLM has also integrated Mooncake support, allowing users to leverage Mooncake connectors for seamless KV transfer. Below is the equivalent rbg() solution for deploying vLLM in a disaggregated setup using Mooncake connectors.
YAML # Joint Solution: RBG + vLLM + Mooncake Connector apiVersion: workloads.x-k8s.io/v1alpha1 kind: RoleBasedGroup metadata: name: vllm-pd-with-mooncake-demo spec: roles: # 1. Gateway: Routing to vLLM instances (SMG or vLLM Proxy) - name: proxy dependencies: [ "prefill", "decode" ] replicas: 1 template: spec: containers: - name: proxy image: lightseekorg/smg:latest command: - smg - --prefiller-host - http://vllm-pd-with-mooncake-demo-prefill-0.s-vllm-pd-with-mooncake-demo-prefill - --prefiller-port - "8000" - --decoder-host - http://vllm-pd-with-mooncake-demo-decode-0.s-vllm-pd-with-mooncake-demo-decode - --decoder-port - "8000" # 2. Prefill Worker (vLLM): Producer role - name: prefill replicas: 1 template: spec: volumes: - name: model persistentVolumeClaim: claimName: qwen2.5-7b - name: dshm emptyDir: medium: Memory sizeLimit: 30Gi containers: - name: prefill image: vllm/vllm-openai:latest command: - sh - -c - | pip install mooncake-transfer-engine && \ vllm serve /models/Qwen2.5-7B-Instruct \ --port 8000 \ --tensor-parallel-size 4 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}' ports: - containerPort: 8000 name: http readinessProbe: initialDelaySeconds: 30 periodSeconds: 10 tcpSocket: port: 8000 resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" requests: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" volumeMounts: - mountPath: /models/Qwen2.5-7B-Instruct name: model - mountPath: /dev/shm name: dshm # 3. Decode Worker (vLLM): Consumer role - name: decode replicas: 1 template: spec: volumes: - name: model persistentVolumeClaim: claimName: qwen2.5-7b - name: dshm emptyDir: medium: Memory sizeLimit: 30Gi containers: - name: decode image: vllm/vllm-openai:latest command: - sh - -c - | pip install mooncake-transfer-engine && \ vllm serve /models/Qwen2.5-7B-Instruct \ --port 8000 \ --tensor-parallel-size 4 \ --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}' ports: - containerPort: 8000 name: http readinessProbe: initialDelaySeconds: 30 periodSeconds: 10 tcpSocket: port: 8000 resources: limits: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" requests: nvidia.com/gpu: "4" rdma/hca: 1 memory: "100Gi" volumeMounts: - mountPath: /models/Qwen2.5-7B-Instruct name: model - mountPath: /dev/shm name: dshm --- apiVersion: v1 kind: Service metadata: labels: app: vllm-pd-with-mooncake-demo name: vllm-pd-with-mooncake-demo namespace: default spec: ports: - name: http port: 8000 protocol: TCP targetPort: 8000 selector: rolebasedgroup.workloads.x-k8s.io/name: vllm-pd-with-mooncake-demo rolebasedgroup.workloads.x-k8s.io/role: proxy type: ClusterIP
Conclusion
Mooncake adds a vital layer of memory virtualization to the open-source AI stack. By enabling PyTorch engines—whether SGLang, vLLM, or TensorRT-LLM —to adopt KVCache-centric architectures, we are paving the way for more efficient, scalable, and lower-latency LLM services.
We invite you to explore the project and start building:
- Mooncake GitHub: https://github.com/kvcache-ai/Mooncake
Mooncake Project Doc: https://kvcache-ai.github.io/Mooncake/