26 August 2026 11:00 - 12:00
Workshop | Deploying and auto-scaling LLM inference on Kubernetes
Getting an LLM application to work is one thing, scaling it reliably, efficiently, and cost-effectively is where most teams hit friction.
This workshop breaks down what actually matters when moving from prototype to production.
Ayushi will walk through the architectural decisions behind scalable LLM systems, how to optimize performance across latency and throughput, and where costs quietly spiral if left unmanaged.
Expect a practical view into how leading teams structure inference pipelines, manage model workloads, and balance trade-offs between quality, speed, and cost.
Key takeaways:
→ How to design LLM architectures that scale without breaking under real-world demand
→ Techniques to optimize latency, throughput, and system performance
→ Where costs typically inflate and how to control them early
→ Trade-offs between model choice, infrastructure, and user experience