Talk: Parimal Parag (November 26, 2025 at 10:00 AM, Seminar room N2407, Zoom)
Talks |
Inference optimization for LLM serving systems
Parimal Parag
Indian Institute of Science at Bangalore
Abstract:
Large language models (LLMs) have led to ground-breaking improvements in the capabilities of generative AI (Gen-AI) applications, leading to their increased adoption, which in turn is leading to increasing volumes of user requests at LLM inference deployments. The existing common implementations of LLM inference engines perform a new prefill every time there is a prompt departure. We analytically model the inference system for a fixed batch size with large rate of prompt arrivals and scheduling prefills after a fixed number of prompt departures. We characterize the throughput of the system as number of prompts departing per unit time for different thresholds. We observe that to maximize throughput, there exists an optimal threshold on the number of prompt departures. We verify this observation with vLLM experiments, and compare the optimal threshold predicted theoretically to the experimentally observed ones.
Biography:
Parimal Parag is currently an associate professor in department of electrical communication engineering at Indian Institute of Science at Bangalore. He was working as senior systems engineer in R&D at ASSIA Inc. from October 2011 to November 2014. He received his B. Tech. and M. Tech. degrees from Indian Institute of Technology Madras in fall 2004; and the PhD degree from Texas A&M University in fall 2011. He was at Stanford University and Los Alamos National Laboratory, in autumn of 2010 and summer of 2007, respectively.
His research interests are in design, performance evaluation, and control of large distributed and networked intelligent systems applying mathematical tools from queueing theory, information theory, coding theory, and optimization methods.
Zoom Link: https://tum-conf.zoom-x.de/j/66219422619?pwd=UGJdNUj67Px3mWmueTrxbDb3aiAfd0.1
Meeting ID: 662 1942 2619
Passcode: 545195