Seminars
Weight Clustering and Sparsity Acceleration Techniques for Efficient Deep Learning Inference on Low-Power Processors
ML, Machine learning, Sparsity Acceleration, Weight Clustering
Description
This survey explores algorithmic and architectural techniques for improving the efficiency of deep neural network (DNN) inference on resource-constrained processors such as embedded CPUs and microcontrollers. Students will survey recent research on weight clustering (quantization-aware weight grouping to reduce compute and memory costs) and sparsity acceleration (exploiting zero-valued weights or activations to skip computations). The survey paper may cover software-level optimizations, compiler-assisted techniques, and hardware extensions — with optional emphasis on RISC-V CPUs or other low-power architectures.
Sample References:
- https://ieeexplore.ieee.org/abstract/document/10546844
- https://ieeexplore.ieee.org/document/11113397
- https://ieeexplore.ieee.org/ielx7/43/9716240/09380321.pdf?tp=&arnumber=9380321&isnumber=9716240&ref=aHR0cHM6Ly9zY2hvbGFyLmdvb2dsZS5jb20v
- https://upcommons.upc.edu/server/api/core/bitstreams/bcfd5ee6-208e-42ba-bf98-e040809f4443/content
Contact
Philipp van Kempen
Supervisor:
Staying Relevant: Investigation on Embedded Software Project Requirements to ensure easy, safe, and secure OTA-Update Ability
Description
To keep modern embedded systems always up to date and relevant, one needs to ensure that the software executed on the ECU is current and secure. A method to ensure this is updating the software after deployment via networks. This process is called Over-the-Air (OTA)Updates. However, to achieve this in a safe and secure manner, one needs to fulfill certain requirements. This seminar aims at identifying the requirements necessary for a software project and Operating System to successfully support OTA-updates.
Research has shown there might be patterns during the project setup, which can ensure, that the compiled binary later is easily maintainable and updateable. The work in [1] provides a good introduction into the topic and presents in the sections “Analysing Update Requirements in IoT Operating Systems” and “Software Module Management” a good starting point for analysis on the requirements of the embedded software system. The work presented in [2] and [3] provides some more insights into the background to this work and can help to dive deeper into the topic, allowing for a more in detail view on the requirements of the embedded software running on the distributed ECUs.
This seminar aims to investigate the current state-of-the-art in embedded software realizing OTA-update strategies. Through a critical analysis of existing literature and case studies, participants will explore the requirements and high-level concepts behind over-the-air updates and how they are realized and ensured in current embedded software projects. By examining the intricacies of OTA-updates, students will gain a deeper understanding of aftermarket maintenance of modern embedded systems, viewing it on a holistic level.
Bibliography
[1] J. Bauwens, P. Ruckebusch, S. Giannoulis, I. Moerman and E. D. Poorter, "Over-the-Air Software Updates in the Internet of Things: An Overview of Key Principles," in IEEE Communications Magazine, vol. 58, no. 2, pp. 35-41, February 2020, doi: 10.1109/MCOM.001.1900125, https://ieeexplore.ieee.org/abstract/document/8999425
[2] Lethaby, Nick. "A more secure and reliable OTA update architecture for IoT devices." Texas Instruments (2018). https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/968/SecureOTA_5F00_TI.pdf
[3] G. Jurkovic and V. Sruk, "Remote firmware update for constrained embedded systems," 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 2014, pp. 1019-1023, doi: 10.1109/MIPRO.2014.6859718. https://ieeexplore.ieee.org/abstract/document/6859718
Contact
Supervisor:
Placement of Systolic Arrays for Neural Network Accelerators
Description
Systolic arrays are a proven architecture for parallel processing across various applications, offering design flexibility, scalability, and high efficiency. With the growing importance of neural networks in many areas, there is a need for efficient processing of the underlying computations, such as matrix multiplications and convolutions. These computations can be executed with a high degree of parallelism on neural network accelerators utilizing systolic arrays.
Just as any application-specific integrated circuit (ASIC) or field-programmable gate array (FPGA) design, neural network accelerators go through the standard phases of chip design, however, treating systolic array hardware designs the same way as any other design may lead to suboptimal results, as utilizing the regular structure of systolic arrays can lead to better solution quality[1].
Relevant works for this seminar topic include the work of Fang et al. [2], where a regular placement is used as an initial solution and then iteratively improved using the RePlAce[3] placement algorithm. The placement of systolic arrays on FPGAs is discussed by Hu et al., where the processing elements of the systolic array are placed on the DSP columns in a manner that is more efficient than the default placement of commercial placement tools[4].
In this seminar, you will investigate different macro and cell placement approaches, focusing on methods that specifically consider systolic array placement. If you have questions regarding this topic, please feel free to contact me.
[1] S. I. Ward et al., "Structure-Aware Placement Techniques for Designs With Datapaths," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 2, pp. 228-241, Feb. 2013, doi: https://doi.org/10.1109/TCAD.2012.2233862
[2] D. Fang, B. Zhang, H. Hu, W. Li, B. Yuan and J. Hu, "Global Placement Exploiting Soft 2D Regularity". in ACM Transactions on Design Automation of Electronic Systems, vol. 30, no. 2, pp. 1-21, Jan. 2025, doi: https://doi.org/10.1145/3705729
[3] C. -K. Cheng, A. B. Kahng, I. Kang and L. Wang, "RePlAce: Advancing Solution Quality and Routability Validation in Global Placement," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 9, pp. 1717-1730, Sept. 2019, doi: https://doi.org/10.1109/TCAD.2018.2859220
[4] H. Hu, D. Fang, W. Li, B. Yuan and J. Hu, "Systolic Array Placement on FPGAs," 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 2023, pp. 1-9, doi: https://doi.org/10.1109/ICCAD57390.2023.10323742
Contact
benedikt.schaible@tum.de
Supervisor:
The ZSim Performance Simulator
Description
Performance simulation is a crucial step in modern design space exploration, enabling the identification of optimal systems. ZSim allows for fast and accurate simulations on the mircoarchitectural level and targets hughe thousand-core sytsems.
"Architectural simulation is time-consuming, and the trend
towards hundreds of cores is making sequential simulation
even slower. Existing parallel simulation techniques either
scale poorly due to excessive synchronization, or sacrifice ac-
curacy by allowing event reordering and using simplistic con-
tention models. As a result, most researchers use sequential
simulators and model small-scale systems with 16-32 cores.
With 100-core chips already available, developing simulators
that scale to thousands of cores is crucial.
We present three novel techniques that, together, make
thousand-core simulation practical. First, we speed up de-
tailed core models (including OOO cores) with instruction-
driven timing models that leverage dynamic binary trans-
lation. Second, we introduce bound-weave, a two-phase
parallelization technique that scales parallel simulation on
multicore hosts efficiently with minimal loss of accuracy.
Third, we implement lightweight user-level virtualization
to support complex workloads, including multiprogrammed,
client-server, and managed-runtime applications, without
the need for full-system simulation, sidestepping the lack
of scalable OSs and ISAs that support thousands of cores.
We use these techniques to build zsim, a fast, scalable,
and accurate simulator. On a 16-core host, zsim models a
1024-core chip at speeds of up to 1,500 MIPS using simple
cores and up to 300 MIPS using detailed OOO cores, 2-3 or-
ders of magnitude faster than existing parallel simulators.
Simulator performance scales well with both the number
of modeled cores and the number of host cores. We vali-
date zsim against a real Westmere system on a wide variety
of workloads, and find performance and microarchitectural
events to be within a narrow range of the real system." - Daniel Sanchez and Christos Kozyrakis: "ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Sytsems" 2013
Contact
conrad.foik@tum.de
Supervisor:
Innovative Memory Architectures in DNN Accelerators
Description
With the growing complexity of neural networks, more efficient and faster processing solutions are vital to enable the widespread use of artificial intelligence. Systolic arrays are among the most popular architectures for energy-efficient and high-throughput DNN hardware accelerators.
While many works implement DNN accelerators using systolic arrays on FPGAs, several (ASIC) designs from industry and academia have been presented [1-3]. To fulfill the requirements that such accelerators place on memory accesses, both in terms of data availability and latency hiding, innovative memory architectures can enable more efficient data access, reducing latency and bridging the gap towards even more powerful DNN accelerators.
One example is the Eyeriss v2 ASIC [1], which uses a distributed Global Buffer (GB) layout tailored to the demands of their row-stationary systolic array dataflow.
In this seminar, a survey of state-of-the-art DNN accelerator designs and design frameworks shall be created, focusing on their memory hierarchy.
References and Further Resources:
[1] Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze. 2019 "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019, doi: https://doi.org/10.1109/JETCAS.2019.2910232
[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. "DianNao family: energy-efficient hardware accelerators for machine learning." In Commun. ACM 59, 11 (November 2016), 105–112. https://doi.org/10.1145/2996864
[3] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. 2017. "In-Datacenter Performance Analysis of a Tensor Processing Unit." In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246
[4] Rui Xu, Sheng Ma, Yang Guo, and Dongsheng Li. 2023. A Survey of Design and Optimization for Systolic Array-based DNN Accelerators. ACM Comput. Surv. 56, 1, Article 20 (January 2024), 37 pages. https://doi.org/10.1145/3604802
[5] Bo Wang, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, and Tiejun Li. 2024. "SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow." ACM Trans. Des. Autom. Electron. Syst. 29, 2, Article 26 (March 2024), 32 pages. https://doi.org/10.1145/3634703
Contact
benedikt.schaible@tum.de