CASCON 2025
Mon 10 - Thu 13 November 2025
Mon 10 Nov 2025 08:30 - 10:00 at Room 3 - Tutorials (TUT-66)

Generative language models are rapidly becoming central to a growing number of businesses worldwide. However, with only a few dominant providers offering language model services, the potential for efficient and effective self-hosting is increasingly at risk. This tutorial explores practical methods for serving large language models (LLMs) in production environments, with a focus on self-hosted deployments. The talk will cover a full deployment pipeline, from acquiring the model to selecting the appropriate server and framework stack. The session will examine the architecture of an end-to-end generative AI serving system and discuss popular frameworks used in inference, such as vLLM, TensorRT, and Triton. Through detailed examples and hands-on insights, attendees will gain the tools and understanding needed to make informed decisions about serving LLMs at scale, independently and efficiently.

Mon 10 Nov

Displayed time zone: Eastern Time (US & Canada) change

08:30 - 10:00
Tutorials (TUT-66)7 Tutorials at Room 3
08:30
90m
Tutorial
Production-Ready LLMs: Frameworks, Workflows, and Trade-offs
7 Tutorials
Amirreza Esmaeili University of British Columbia, Jackie Lam Charli Capital, Elham Alipour Charli Capital, Fatemeh Hendijani Fard University of British Columbia, Okanagan