Mooncake Joins PyTorch Ecosystem

We are thrilled to announce that Mooncake has officially joined the PyTorch Ecosystem! By integrating Mooncake’s high-performance KVCache transfer and storage capabilities with PyTorch-native inference engines like SGLang, and vLLM, and TensorRT-LLM, we are unlocking new levels of throughput and scalability for large language model deployments.

To view the PyTorch Ecosystem, see the PyTorch Landscape. Learn more about how projects can join the PyTorch Ecosystem.

About Mooncake

Mooncake is designed to solve the “memory wall” in LLM serving. As context lengths grow and models scale, the static binding of Key-Value (KV) cache to specific GPU workers becomes a primary bottleneck.

Mooncake empowers inference engines to break this binding, unlocking four critical capabilities:

(Encoder) Prefill-Decode Disaggregation: Mooncake’s high-performance Mooncake Transfer Engine separates heavy computation (prefill/encoder) from latency-sensitive generation (decoding) into distinct clusters.
Global KVCache Reuse: By acting as a distributed shared memory for KV blocks, Mooncake Store enables valid cache to be reused globally across different requests and engine instances.
Elastic Expert Parallelism: By decoupling experts from specific workers, Mooncake-EP enables elastic and resilient serving where experts of Mixture-of-Experts (MoE) models can be dynamically routed or recovered, ensuring high availability even during partial node failures.
PyTorch Distributed Backend: Mooncake Backend serves as a fault-tolerant PyTorch distributed backend. It provides robust collective communication primitives capable of continuing operation seamlessly in the presence of rank failures.
Weighs Updating: Mooncake Store enables rapid weight updates for RL and checkpoint scenarios by storing weights internally. It offers tensor-native and zero-copy APIs.

Wide Industry Adoption

Mooncake originated from a research collaboration between Moonshot AI and Tsinghua University. It was born from the need to solve the “memory wall” in serving massive-scale models like Kimi. Since open-sourcing, it has evolved into a thriving community-driven project.

Mooncake’s architecture has been battle-tested in some of the world’s most demanding production environments. Its ability to decouple compute from memory has led to wide adoption across leading organizations, including Moonshot AI (Kimi), Alibaba Cloud, Ant Group, JD.com, Tencent, Meituan, Approaching.AI and LightSeek Foundation.

These organizations utilize Mooncake to maximize GPU utilization and ensure smooth serving for millions of concurrent users.

In Action: A Joint Solution

To demonstrate the full potential of this architecture, we present a joint solution that combines Mooncake with the ecosystem’s leading inference engines and orchestration tools.

In this architecture, we will use RoleBasedGroup (RBG, https://github.com/sgl-project/rbg) to orchestrate the entire topology, defining the relationships and startup order of the cluster. It deploys Shepherd Model Gateway (SMG, https://github.com/lightseekorg/smg) as the critical routing layer, which intelligently directs incoming requests to the appropriate workers based on cache locality and system load. The heavy lifting is then performed by SGLang (https://github.com/sgl-project/sglang) or vLLM (https://github.com/vllm-project/vllm) instances serving as compute workers, while Mooncake functions as the high-speed data plane: its Transfer Engine pushes prefilled KV cache via RDMA/NVLink, and its Store persists that cache for global reuse by decoding nodes.

1. Deployment with SGLang + Mooncake + SMG

Below is the RBG configuration that immediately deploys a complete SGLang architecture. In this case, both Prefill-Decode Disaggregation and Global KVCache-Reuse are enabled. The Prefill instances utilize Mooncake TE to transfer kvcache to Decode instances, while Mooncake Store facilitates reusing KVCache across different requests within the Prefill instance (more details in KEP-74 Mooncake Integration and pd-disaggregated-with-mooncake.yaml).

    YAML

# Joint Solution: RBG + SMG + SGLang + Mooncake (Production Ready)
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: sglang-mooncake-smg-v2
spec:
  roles:
    # 1. Mooncake Master: Centralized Metadata Server for TE and Store
    - name: mooncake-master
      replicas: 1
      template:
        spec:
          containers:
            - name: master
              image: lmsysorg/sglang:latest
              env:
                - name: POD_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.podIP
              command: ["mooncake_master"]
              args:
                - --enable_http_metadata_server=true
                - --rpc_address=$(POD_IP)
                - --rpc_port=50051
                - --http_metadata_server_host=$(POD_IP)
                - --http_metadata_server_port=8080
                - --metrics_port=9003

    # 2. Mooncake Store: Distributed KVCache Storage Nodes
    - name: mooncake-store
      replicas: 3
      dependencies: ["mooncake-master"]
      template:
        spec:
          containers:
            - name: store-node
              image: lmsysorg/sglang:latest
              env:
                - name: MOONCAKE_MASTER
                  value: "s-sglang-mooncake-smg-v2-mooncake-master:50051"
                - name: MOONCAKE_TE_META_DATA_SERVER
                  value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata"
                - name: MOONCAKE_GLOBAL_SEGMENT_SIZE
                  value: "45gb"
                - name: MOONCAKE_PROTOCOL
                  value: "rdma" # Use RDMA for zero-copy KVCache transfer
              command: ["python3", "-m", "mooncake.mooncake_store_service"]
              resources:
                limits:
                  memory: "50Gi"
                  rdma/hca: 1 # Required for high-speed TE transfer
                requests:
                  memory: "50Gi"
                  rdma/hca: 1

    # 3. Prefill Worker (SGLang): High-throughput Prefill with Mooncake Push
    - name: prefill-worker
      replicas: 1
      dependencies: ["mooncake-master", "mooncake-store"]
      template:
        spec:
          containers:
            - name: sglang-prefill
              image: lmsysorg/sglang:latest
              env:
                - name: MOONCAKE_MASTER
                  value: "s-sglang-mooncake-smg-v2-mooncake-master:50051"
                - name: MOONCAKE_TE_META_DATA_SERVER
                  value: "http://s-sglang-mooncake-smg-v2-mooncake-master:8080/metadata"
                - name: MOONCAKE_PROTOCOL
                  value: "rdma"
              command:
                - python3
                - -m
                - sglang.launch_server
                - --model-path /models/Qwen3
                - --tp 4
                - --disaggregation-mode prefill
                - --disaggregation-transfer-backend mooncake # Activates Mooncake TE for KVCache Push
                - --enable-hierarchical-cache # Enables KVCache offloading
                - --hicache-storage-backend mooncake # Uses Mooncake as the L2/L3 cache backend
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/hca: 1

    # 4. Decode Worker (SGLang): Low-latency Generation with Mooncake Pull
    - name: decode-worker
      replicas: 2
      dependencies: ["mooncake-master", "prefill-worker"]
      template:
        spec:
          containers:
            - name: sglang-decode
              image: lmsysorg/sglang:latest
              command:
                - python3
                - -m
                - sglang.launch_server
                - --model-path /models/Qwen3
                - --tp 4
                - --disaggregation-mode decode # Pulls shared KVCache from Mooncake Store
              resources:
                limits:
                  nvidia.com/gpu: "4"
                  rdma/hca: 1

    # 5. Shepherd Model Gateway (SMG): Intelligent PD-Disaggregation Router
    - name: smg-router
      replicas: 1
      dependencies: ["prefill-worker", "decode-worker"]
      template:
        spec:
          containers:
            - name: router
              image: lightseekorg/smg:latest
              command:
                - smg
                - --pd-disaggregation
                - --prefill http://s-sglang-mooncake-smg-v2-prefill-worker:8000
                - --decode http://s-sglang-mooncake-smg-v2-decode-worker:8000
                - --host 0.0.0.0
                - --port 8000

2. Deployment with vLLM + Mooncake

vLLM has also integrated Mooncake support, allowing users to leverage Mooncake connectors for seamless KV transfer. Below is the equivalent rbg() solution for deploying vLLM in a disaggregated setup using Mooncake connectors.

YAML

# Joint Solution: RBG + vLLM + Mooncake Connector
apiVersion: workloads.x-k8s.io/v1alpha1
kind: RoleBasedGroup
metadata:
  name: vllm-pd-with-mooncake-demo
spec:
  roles:
    # 1. Gateway: Routing to vLLM instances (SMG or vLLM Proxy)
  - name: proxy
    dependencies: [ "prefill", "decode" ]
    replicas: 1
    template:
      spec:
        containers:
        - name: proxy
          image: lightseekorg/smg:latest
          command:
          - smg
          - --prefiller-host
          - http://vllm-pd-with-mooncake-demo-prefill-0.s-vllm-pd-with-mooncake-demo-prefill
          - --prefiller-port
          - "8000"
          - --decoder-host
          - http://vllm-pd-with-mooncake-demo-decode-0.s-vllm-pd-with-mooncake-demo-decode
          - --decoder-port
          - "8000"

    # 2. Prefill Worker (vLLM): Producer role
  - name: prefill
    replicas: 1
    template:
      spec:
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: qwen2.5-7b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
        containers:
        - name: prefill
          image: vllm/vllm-openai:latest
          command:
          - sh
          - -c
          - |
            pip install mooncake-transfer-engine && \
            vllm serve /models/Qwen2.5-7B-Instruct \
              --port 8000 \
              --tensor-parallel-size 4 \
              --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
          ports:
          - containerPort: 8000
            name: http
          readinessProbe:
            initialDelaySeconds: 30
            periodSeconds: 10
            tcpSocket:
              port: 8000
          resources:
            limits:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
            requests:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
          volumeMounts:
          - mountPath: /models/Qwen2.5-7B-Instruct
            name: model
          - mountPath: /dev/shm
            name: dshm

    # 3. Decode Worker (vLLM): Consumer role
  - name: decode
    replicas: 1
    template:
      spec:
        volumes:
        - name: model
          persistentVolumeClaim:
            claimName: qwen2.5-7b
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 30Gi
        containers:
        - name: decode
          image: vllm/vllm-openai:latest
          command:
          - sh
          - -c
          - |
            pip install mooncake-transfer-engine && \
            vllm serve /models/Qwen2.5-7B-Instruct \
              --port 8000 \
              --tensor-parallel-size 4 \
              --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_consumer"}'
          ports:
          - containerPort: 8000
            name: http
          readinessProbe:
            initialDelaySeconds: 30
            periodSeconds: 10
            tcpSocket:
              port: 8000
          resources:
            limits:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
            requests:
              nvidia.com/gpu: "4"
              rdma/hca: 1
              memory: "100Gi"
          volumeMounts:
          - mountPath: /models/Qwen2.5-7B-Instruct
            name: model
          - mountPath: /dev/shm
            name: dshm
---

apiVersion: v1
kind: Service
metadata:
  labels:
    app: vllm-pd-with-mooncake-demo
  name: vllm-pd-with-mooncake-demo
  namespace: default
spec:
  ports:
  - name: http
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    rolebasedgroup.workloads.x-k8s.io/name: vllm-pd-with-mooncake-demo
    rolebasedgroup.workloads.x-k8s.io/role: proxy
  type: ClusterIP

Conclusion

Mooncake adds a vital layer of memory virtualization to the open-source AI stack. By enabling PyTorch engines—whether SGLang, vLLM, or TensorRT-LLM —to adopt KVCache-centric architectures, we are paving the way for more efficient, scalable, and lower-latency LLM services.

We invite you to explore the project and start building:

Mooncake GitHub: https://github.com/kvcache-ai/Mooncake

Mooncake Project Doc: https://kvcache-ai.github.io/Mooncake/

Mooncake Joins PyTorch Ecosystem

About Mooncake

Wide Industry Adoption

In Action: A Joint Solution

1. Deployment with SGLang + Mooncake + SMG

2. Deployment with vLLM + Mooncake

Conclusion

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news