LLM-D: Serving AI Inference at scale

Introduction:

AI inference is the "doing" part of artificial intelligence. It's the moment a trained model stops learning and starts working, turning its knowledge into real-world results.

We all use cutting-edge frontier models in our day-to-day use like Gemini, Claude etc. In our personal life we don’t interact with open-source, purpose built models like Qwen, Llama etc that often who have smaller parameters and often fine-tuned for specific tasks. But for enterprises, there is a growing demand to use these type of models internally. Whether it is to serve models fine-tuned on proprietary data, or for agentic purposes like tool-calling at scale, understanding-intent etc. This creates a unique use-case of using cloud-native frameworks like KServe and llm-d to put forward best-practices to serve AI inference

LLM-D

llm-d: a Kubernetes-native high-performance distributed LLM inference framework

llm-d is an open-source inference framework which offers battle-tested configurations to start our model deployment journey. Rather than using vanilla Kubernetes to run vLLM pods in accelerated hardware, llm-d offers better performance metrics by solving problems like KV-Cache offloading, separation of prefill and decode phases(Prefill/Decode Disaggregation), Intelligent inference routing etc. These features helps us in serving LLM models at scale. One example is, by separating prefill and decode phases, we can deploy prefill pods in compute-bound hardware and decode pods in memory-bound hardware. Also by tweaking the PD ratio of pods, you can achieve smaller TTFT(time to first token) and TPOT(time per output token). The llm-d framework have many well-lit paths to solve specific problems depending on the type of QPS(queries per sec) workloads and models.

Architecture

llm-d architecture have some novel components like GIE(GatewayAPI Inference Extension) and EndPointPicker(EPP) which uses filtering and scoring techniques to enable Gateway API to route traffic dynamically to the next best pod.

Here is a detailed infographic on inference scheduling

Inference Gateway Architecture

Testing Inference Scheduling

I have deployed a GKE cluster using Nvidia L4 instance and some general compute for all other pods. Firstly you need to enable GKE Gateway class in-order for llm-d to create Gateway during helm install. Here is a how-to for that. Once it is done, you need to install monitoring stack to observe the metrics on vLLM pods. llm-d github page have a bash script to deploy kube-prometheus-stack. The I have installed the helmcharts to deploy modelserver(ms) pods in L4 GPU instance.

llm-d-inference-scheduler   gaie-inference-scheduling-epp-5dc59b4767-dv8tx                    1/1     Running   0          51m
llm-d-inference-scheduler   ms-inference-scheduling-llm-d-modelservice-decode-5b7bb867kxmff   0/2     Pending   0          27m
llm-d-inference-scheduler   ms-inference-scheduling-llm-d-modelservice-decode-5b7bb867zxgnm   2/2     Running   0          32m
llm-d-inference-scheduler   ms-inference-scheduling-llm-d-modelservice-decode-77f5d7b89bc7n   0/2     Pending   0          31m

Then I have deployed httproute to connect my InferencePool to the Gateway.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: llm-d-inference-scheduling
spec:
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: infra-inference-scheduling-inference-gateway
  rules:
    - backendRefs:
      - group: inference.networking.k8s.io
        kind: InferencePool
        name: gaie-inference-scheduling
        weight: 1
      matches:
      - path:
          type: PathPrefix
          value: /

Then I could run some load tests on the single vLLM pod that I managed to deploy through the external gateway ip.

Conclusion

All the technologies like llm-d, KServe and GIE are still going through rapid progression. LLM inference poses unique challenges which can be tricky to solve. In the future where there will be massive consumption of tokens by applications, you cannot pay per-token price while it is just an additional feature in your application(imagine booking uber using voice). In those circumstances, frameworks like llm-d is the best alternative.

LLM-D: Serving AI Inference at scale

Introduction:

LLM-D

Architecture

Testing Inference Scheduling

Conclusion

Comments

More from this blog

GitOps based Platform Engineering feat. Crossplane and ArgoCD

Gateway API: The better way to Ingress

Guide to Using AL2023 AMI with Amazon EKS

Hit me with a BRIC(S)

Command Palette

Introduction:

LLM-D

Architecture

Testing Inference Scheduling

Conclusion

Comments

More from this blog