Production-Ready LLMs: Frameworks, Workflows, and Trade-offs (CASCON 2025 - Tutorials)

Mon 10 - Thu 13 November 2025

Who

Amirreza Esmaeili, Jackie Lam, Elham Alipour, Fatemeh Hendijani Fard

Track

CASCON 2025 7 Tutorials

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 10 Nov 2025 08:30 - 10:00 at Room 3 - Tutorials (TUT-66)

Abstract

Generative language models are rapidly becoming central to a growing number of businesses worldwide. However, with only a few dominant providers offering language model services, the potential for efficient and effective self-hosting is increasingly at risk. This tutorial explores practical methods for serving large language models (LLMs) in production environments, with a focus on self-hosted deployments. The talk will cover a full deployment pipeline, from acquiring the model to selecting the appropriate server and framework stack. The session will examine the architecture of an end-to-end generative AI serving system and discuss popular frameworks used in inference, such as vLLM, TensorRT, and Triton. Through detailed examples and hands-on insights, attendees will gain the tools and understanding needed to make informed decisions about serving LLMs at scale, independently and efficiently.

Amirreza Esmaeili

University of British Columbia

Jackie Lam

Charli Capital

Elham Alipour

Charli Capital

Fatemeh Hendijani Fard

University of British Columbia, Okanagan

Canada