CASCON 2025
Mon 10 - Thu 13 November 2025
Mon 10 Nov 2025 08:30 - 10:00 at Room 344 - Tutorials (TUT-66)

Generative language models are rapidly becoming central to a growing number of businesses worldwide. However, with only a few dominant providers offering language model services, the potential for efficient and effective self-hosting is increasingly at risk. This tutorial explores practical methods for serving large language models (LLMs) in production environments, with a focus on self-hosted deployments. The talk will cover a full deployment pipeline, from acquiring the model to selecting the appropriate server and framework stack. The session will examine the architecture of an end-to-end generative AI serving system and discuss popular frameworks used in inference, such as vLLM, TensorRT, and Triton. Through detailed examples and hands-on insights, attendees will gain the tools and understanding needed to make informed decisions about serving LLMs at scale, independently and efficiently.

Target Audience: This tutorial is designed for researchers and practitioners interested in deploying large language models outside of managed cloud services. It is especially relevant for those working in machine learning infrastructure, systems, and applied NLP, as well as for anyone looking to understand the practical aspects of serving generative models at scale.

Mon 10 Nov

Displayed time zone: Eastern Time (US & Canada) change

08:30 - 10:00
Tutorials (TUT-66)7 Tutorials at Room 344
08:30
90m
Tutorial
Production-Ready LLMs: Frameworks, Workflows, and Trade-offs
7 Tutorials
Amirreza Esmaeili University of British Columbia, Jackie Lam Charli Capital, Elham Alipour Charli Capital, Fatemeh Hendijani Fard University of British Columbia, Okanagan