Back to all episodes
Episode 30LLM & AIFREE

LLM Serving Infrastructure

5:03 · Alex & Sam

vLLMNVIDIAAWS#vllm#paged-attention#speculative-decoding#gpu-cluster

Show Notes

Serving LLMs at scale requires purpose-built infrastructure. Alex and Sam discuss vLLM, PagedAttention, speculative decoding, and how cloud providers think about GPU cluster scheduling.

Key Takeaways

  • Alex and Sam discuss vLLM, PagedAttention, speculative decoding, and how cloud providers think about GPU cluster scheduling.
  • Core concepts covered: Vllm, Paged Attention, Speculative Decoding, and 1 more.
  • Key trade-offs and design decisions you can apply to your own system design interviews.

Read the full article

LLM Serving Infrastructure — deep dive with diagrams, tradeoffs & interview questions

Architecture Diagram

LLM Serving Infrastructure architecture diagram