Prefetching the Translation Path: MMU-Prefetch Co-Design for I/O Devices and Accelerators
Beschreibung
Modern I/O devices and accelerators increasingly rely on virtual memory support in order to simplify programming, improve isolation, and enable shared virtual address spaces with CPUs. However, address translation on the device side is often expensive. When an IOMMU or device TLB misses, the accelerator may suffer significant stalls due to multi-level page table walks and limited translation locality. Prior work has shown that translation overhead can become a major bottleneck for accelerators and that hiding or restructuring translation latency is an important architectural problem.
At the same time, traditional prefetching research mainly focuses on data accesses, while the translation path itself can also be seen as a target for latency hiding. This raises an interesting question: instead of only prefetching data, can we also prefetch translation-related information, such as page-table entries, or redesign the translation path so that address translation and data access can overlap more effectively? Recent and prior studies suggest that this direction is promising for accelerator-centric systems.
The goal of this seminar is to build a clear architectural understanding of how prefetching and address translation can be combined for accelerators or IO systems. The seminar will compare different approaches, identify their main design trade-offs, and discuss whether translation-path prefetching could become an important design direction for future heterogeneous systems
Voraussetzungen
- Basic Knowledge of Computer Architecure
- Good English Skill
Kontakt
Yuanji Ye
yuanji.ye@tum.de
Betreuer:
Prefetching for LLM Inference: KV Cache Movement
Beschreibung
Large language model(LLM) inference is increasingly limited by memory access rather than pure computation, especially in long-context and decoding scenarios. A major reason is the high cost of moving model weights and KV cache data across the memory hierarchy. Recent work shows that prefetching can be used to overlap memory movement with ongoing computation or communication.
Compared with traditional hardware prefetching, LLM inference introduces a different setting. The prefetched object is no longer a small cacheline but larger units such as KV blocks. Moreover, prefetch timing depends on token generation, layer execution order, and runtime scheduling. Some recent studies also suggest that temporal patterns in attention behavior can be exploited to guide KV cache management and cross-token prefetching more effectively.
The goal of this seminar is to build a clear understanding of how prefetching concepts can be extended to AI inference systems. The student will compare recent approaches for KV cache and weight prefetching, discuss and summarize their architectural trade-offs, and determine whether LLM-serving workloads require new prefetching principles beyond those in CPU memory hierarchies
Voraussetzungen
- Basic Knowledge of Computer Architecure
- Good English Skill
Kontakt
Yuanji Ye
yuanji.ye@tum.de