Seminar on Topics in Electronic Design Automation

Lecturer (assistant)
Number0000003270
TypeSeminar
Duration3 SWS
TermWintersemester 2024/25
Language of instructionEnglish
Position within curriculaSee TUMonline

Dates

Admission information

See TUMonline
Note: Students have to choose a seminar topic BEFORE the introduction lesson. To do so, students directly contact the supervisor of the topic they are interested in. Topics are assigned on a first-come-first-served basis. The supervisor needs to confirm that a topic has been assigned to the student. First after the confirmation the students is considered to be registered for the seminar. For a list of topics refer to the following link: https://www.ce.cit.tum.de/eda/lehrveranstaltungen/seminare/seminar-on-topics-in-electronic-design-automation/

Objectives

At the end of the seminar, the students are able to present a new idea or an existing approach in the area of computer-aided circuit and system design in an understandable and convincing manner. For this purpose, the following competencies will be acquired: * The students are able to independently familiarize themselves with a scientific topic in the field of electronic design automation * The students are able to present their topic in a structured way according to problem formulation, state of the art, goals, methods and results. * The students can present their topic, according to the above mentinoned structure, orally in form of a presentation, visually as a poster and with a set of slides, and in form of a writen report.

Description

Specific seminar topics in the area of electronic design automation will be offered. Examples are analog design methodology, digital design methodology, layout synthesis, and system-level design methodology. The students work independently on a scientific topic and write a paper of 4 pages. At the end of the seminar, the students present their topic during a scientific talk. In a subsequent discussion the topic will be treated in-depth.

Prerequisites

No specific previous knowledge required.

Teaching and learning methods

Learning method: Students elaborate a given scientific topic by themselves and are advised by a research assistant. Teaching method: Introductory lessons will be given, which cover advice on the work procedure during the seminar, scientific writing techniques as well as the preparation of an oral presentation. The students discuss further (specific) details with the advising research assistants on an individual basis. Media: All current techniques for preparing and presenting papers and talks will be applied, e.g. - blackboard, whiteboard - electronic slides, beamer - electronic word processing - electronic slide processing

Examination

The examination is based on a scientific elaboration. This examination consist of a written part (50%) in form of a paper (4 pages), and of an oral part (50%) in form a presentation of approximatly 30 minutes (including a subsequent discussion). Through the scientific elaboration students show that they can prepare, structure and present, e.g., the state-of-the-art, a new idea or an existing approach in the area of electronic design automation.

Recommended literature

A set of topics and related literature is given at the start of the course. Each participant selects his/her topic.

Links

Topics - online

Please find the topics for the WT 24/25 below.

The topics are handed out on a first-come-first served basis. Contact the supervisor of the topic to get more information and reserve a topic. Please make sure, that you get a confirmation of your supervisor onlce you selected a topic. We are looking forward to see you in the seminar.

Seminars

Methodologies for Accelerating Deep Learning Inference on Different Tensor Machines

Description

 

The proliferation of deep learning (DL) applications has created an unprecedented demand for high-performance hardware accelerators capable of handling the computationally intensive tasks involved in DL processing. To address this need, various instruction set architectures (ISAs) have introduced specialized matrix extensions, such as Arm's Scalable Matrix Extension (SME) [1] and Intel's Advanced Matrix Extension (AMX) [2], to accelerate matrix operations that are at the heart of DL computations.

 

However, the proprietary nature of these implementations has limited their adoption and customization, highlighting the need for open-source and flexible solutions. The RISC-V ISA, with its open-source architecture, has emerged as a promising platform for developing custom extensions for tensor operations [3] [4]. Researchers have proposed various methodologies for both dependent and independent matrix extensions, including the use of matrix registers and accumulator registers, to improve performance, efficiency, and scalability.

 

The seminar will provide a comprehensive overview of the current state of custom extensions for tensor operations, highlighting the advantages and limitations of existing programming models, design methodologies, and performance evaluation techniques.

 

The seminar should cover the following topics:

  1. Research existing programming models for custom extensions for tensor operations particularly for RISCV, including their advantages and limitations.

  2. Design methodologies for different tensor extensions, from DL compilation, design, simulation, and deployment, highlighting their strengths and weaknesses.

  3. Analysis and evaluation of the performance of different programming models for custom extensions of tensor operations, considering factors such as parallelism, latency, and data transfer.

 

References:

 

[1] Intel®Architecture Instruction Set Extensions and Future Features Programming Reference

 

[2] Arm® Architecture Reference Manual Supplement, The Scalable Matrix Extension (SME), for Armv9-A

 

[3] V. Verma, T. Tracy II, and M. R. Stan, “EXTREM-EDGE - EXtensions To RISC-V for Energy-efficient ML inference at the EDGE of IoT,” Sust. Comp.: Informatics and Systems, vol. 35, p. 100742, 2022.

 

[4] Perotti, Matteo & Zhang, Yichao & Cavalcante, Matheus & Mustafa, Enis & Benini, Luca. (2024). MX: Enhancing RISC-V's Vector ISA for Ultra-Low Overhead, Energy-Efficient Matrix Multiplication.

 

 

Contact

Supervisor:

Conrad Foik

Compression and Decompression Techniques for Activation data

Description

 

Deep Neural Networks (DNNs) offer possibilities for tackling practical challenges and broadening the scope of Artificial Intelligence (AI) applications. The current neural network's considerable computational and memory needs are attributed to the increasing complexity of network structures, which involve numerous layers containing millions of parameters. The energy consumption during the inference execution of deep neural networks (DNNs) is predominantly attributed to the access and processing of these parameters. One of the main possible areas to tackle is the storage and access of activation data computed during inference.

 

The objective of this seminar is to conduct a comprehensive literature survey around compression and decompression techniques available for activation data. Gather the advantages and disadvantages posed by the available solutions. Depending on the time and reviewed contents, the survey can be extended to find a hardware-efficient technique.

 

[1] Chen, Yu-Hsin, Yang, Tien-Ju, Emer, Joel and Sze, Vivienne (2019): Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices, IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9(2): 292–308: Conference Name: IEEE Journal on Emerging and Selected Topics in Circuits and Systems.

 

[2] Wang, Cong, Xiao, Yanru, Gao, Xing, Li, Li and Wang, Jun (2023): A Framework for Behavioral Biometric Authentication Using Deep Metric Learning on Mobile Devices, IEEE Transactions on Mobile Computing 22(1): 19–36: Conference Name: IEEE Transactions on Mobile Computing.

 

[3] Hawlader, Faisal, Robinet, Fran¸cois and Frank, Rapha¨el (2023): Poster: Lightweight Features Sharing for Real-Time Object Detection in Cooperative Driving, 2023 IEEE Vehicular Networking Conference (VNC), S. 159–160: ISSN: 2157-9865.

 

[4] Lee, Minjae, Park, Seongmin, Kim, Hyungmin, Yoon, Minyong, Lee, Janghwan, Choi, Jun Won, Kim, Nam Sung, Kang, Mingu and Choi, Jungwook (2024): SPADE: Sparse Pillar-based 3D Object Detection Accelerator for Autonomous Driving, 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), S. 454–467: ISSN: 2378-203X.

 

[5] Price, Ilan and Tanner, Jared (2023): Improved Projection Learning for Lower Dimensional Feature Maps, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. 1–5: ISSN: 2379-190X.

 

 

Contact

 

Andrew.Stevens@infineon.com ; Mounika.Vaddeboina@infineon.com

 

Supervisor:

Conrad Foik

Tensor program/graph rewriting-based optimization techniques

Description

 

In machine learning (ML), tensor kernels often translate into pure mathematical expressions. This presents an interesting prospect for optimization through term rewriting [1]. A fundamental optimization technique used by deep learning frameworks is graph rewriting [2] [3]. Within production frameworks, the decision to apply rewrite rules and in what sequence rests heavily on heuristics. Research indicates that seeking a more optimal sequence of substitutions, rather than relying solely on heuristics, can lead to the discovery of better tensor computation graphs.

 

Moreover, term rewriting techniques prove beneficial in optimizing low-level tensor programs [4] alongside tensor graphs. Traditionally, application programmers manually add hardware function calls, or compilers incorporate them through handcrafted accelerator-specific extensions. Integrating domain-specific instruction or operation support into an existing compiler typically involves custom pattern matching to map resource-intensive tensor operations from applications to hardware-specific invocations. Despite these modifications related to pattern matching, users may still need to manually adjust their applications to aid the compiler in identifying opportunities for dispatching operations to target hardware, such as by altering data types or optimizing loops.

 

Leveraging term rewriting techniques offers a promising approach for streamlining various transformation and mapping tasks both for tensor graphs as well as programs. This approach not only enhances efficiency but also holds the potential for simplifying the deployment of DSLs, thus advancing the field of machine learning and computational optimization.

 

This seminar topic should cover literature research on existing rewriting techniques on tensor programs and graphs which includes:

  1. Research on existing rewriting techniques
  2. Its application on tensor programs and graphs.
  3. Challenges and the relations between different rewriting techniques
  4. Applications in and with the existing machine learning compiler frameworks

 

[1] Franz Baader et al. 1998. Term Rewriting and All That. Cambridge University Press. https://doi.org/10.1017/ CBO9781139172752.

[2] Zhihao Jia et al. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). Association for Computing Machinery, New York, NY, USA, 47–62. https://doi.org/10.1145/3341301.3359630.

[3] Yang, Y., et al. (2021). Equality Saturation for Tensor Graph Superoptimization. ArXiv, abs/2101.01332.

[4] Gus Henry Smith et al. 2021. Pure Tensor Program Rewriting via Access Patterns (Representation Pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (Virtual, Canada) (MAPS 2021). Association for Computing Machinery, New York, NY, USA, 21–31. https://doi.org/10.1145/3460945.3464953.

 

 

Contact

Supervisor:

Conrad Foik

Physically Aware RTL Generation: Navigating Design and Performance Challenges

Description

Currently, digital IPs are implemented using hardware description languages at the Register Transfer Level (RTL) abstraction. In principle, RTL models are independent of physical features and constraints, as they are mapped to hardware implementations through synthesis and RTL-to-Gates (RTL2GDSII) tools. However, the chosen RTL implementation still significantly impacts the quality of the result, particularly in technology-dependent features such as timing, power consumption, and overall performance.

This seminar delves into how to bridge the gap between technology-independent RTL code and technology-dependent features by making RTL generators aware of physical characteristics. By doing so, it becomes possible to optimize designs for real-world applications, ensuring that they meet both performance and power efficiency targets. Additionally, generation should integrate further constraints to refine and enhance the implementation outcomes. This dual approach of awareness and constraint-driven design aims to elevate current design and generation methodologies in several key areas:

  • More efficient IPs through physical-aware RTL: Incorporating physical features directly into RTL can lead to the development of digital IPs that are not only high-performing but also power-efficient. This ensures that designs can meet stringent operational requirements while maintaining energy efficiency.
  • Increased automation in IP generation: By enhancing automation in the IP generation process, it becomes possible to reduce manual effort, minimize errors, and accelerate time-to-market. This increased level of automation can streamline workflows and enable designers to focus on innovation rather than routine tasks.
  • Enhanced quality of results: By making RTL generators cognizant of physical features, the resulting hardware implementations can achieve superior quality in terms of both timing and power consumption. This ensures that the end products are reliable and meet the required specifications.
  • Reduced design iterations: With physical-aware RTL and automated constraint integration, the number of design iterations needed to achieve optimal results can be significantly reduced. This not only saves time but also reduces costs associated with prolonged design cycles.

By addressing these areas, the seminar aims to provide insights and practical solutions for overcoming the inherent challenges in the current design and generation approaches, paving the way for more efficient, automated, and high-quality digital IPs.

 

[1] Alex Carsello, James Thomas, Ankita Nayak, Po-Han Chen, Mark Horowitz, Priyanka Raina, and Christopher Torng. 2021. Enabling Reusable Physical Design Flows with Modular Flow Generators. arXiv preprint arXiv:2111.14535 (2021).

[2] Alon Amid, David Biancolin, Abraham Gonzalez, Daniel Grubb, Sagar Karandikar, Harrison Liew, Albert Magyar, Howard Mao, Albert Ou, Nathan Pemberton, et al. 2020. Chipyard: Integrated design, simulation, and implementation framework for custom socs. IEEE Micro 40, 4 (2020), 10–21

[3] Edward Wang, Colin Schmidt, Adam Izraelevitz, John Wright, Borivoje Nikoli?, Elad Alon, and Jonathan Bachrach. 2020. A methodology for reusable physical design. In 2020 21st International Symposium on Quality Electronic Design (ISQED). IEEE, 243–249.

Contact

Mohamed.Badawy@infineon.com, Wolfgang.Ecker@tum.de

Supervisor:

Conrad Foik

Comparison of ARM and RISC-V ISA Bit-Manipulation Instructions

Description

Different processors of ARM are de-facto standards in embedded SoCs and are on their way to being established in data centers and high-performance computing applications. Since about 15 years ago, a competitor is showing up which is following a completely different approach: An open and free to use instruction set definition that leaves space for various implementations. ISA subset support even allows special instruction extension and both open source and proprietary solutions. Only one thing must be guaranteed, the support of the I32 (respectively I64) instruction set.
The scope of this seminar is the comparison of the RV Zb* instruction set with instructions supported by ARM.

  • Starting point are the ISA descriptions in [1-4]. Further, literature research shall be performed that searches publications comparing the two ISAs.
  • As a first step the instruction definition relevant system states (e.g. Program Counter, Register Files shall be enumerated and compared.
  • As a next step, a summary of the RISC-V B and related ARM bit-manipulation Instructions should be made. An important aspect is the size of the instructions.
  • Next a comparison strategy shall be setup, e.g. by defining instructions with the same behavior, a similar behavior and a different behavior or by registers being involved.
  • Further, it should be elaborated how instructions supported by ARM can be mimicked by a sequence of RISC-V instructions.

The findings should be summarized in a 4-page paper which might be submitted to a workshop or conference with RISC-V scope and should be presented in an EDA seminar.

 

[1] Cortex-M0 Devices Generic User Guide Version 1.0, Chapter 3. The Cortex-M0 Instruction Set https://developer.arm.com/documentation/dui0497/a/the-cortex-m0-instruction-set
[2] RISC-V Specifications https://riscv.org/technical/specifications/
[3] RISC-V Extension Specifications https://wiki.riscv.org/display/HOME/Ratified+Extensions
[4] A Comparative Analysis of ARM and RISC-V ISAs for Deeply Embedded Systems https://portal.fis.tum.de/en/publications/a-comparative-analysis-of-arm-and-risc-v-isas-for-deeply-embedded

Contact

Supervisor:

Conrad Foik

Survey of the Annual Reactive Synthesis Competition (SYNTCOMP)

Description

Today, digital hardware design is most commonly done on the Register-Transfer Level (RTL). This low abstraction level uses sequential logic to describe the behavior and structure of digital circuits using Hardware Description Languages (HDLs) like VHDL or SystemVerilog. To lower design and verification complexity, researchers have proposed moving beyond RTL and using so-called temporal logic instead, which allows for a higher-level expression of temporal correlations. For example, implementing a multi-cycle delay between two actions in VHDL or SystemVerilog requires manually describing this behavior in sequential logic, such as FSMs, buffers, or counters. Conversely, temporal logic can directly specify that action A implies action B after N cycles or even that an event will eventually happen.


Despite the advantages of easily expressing complex temporal correlations, synthesizing actual circuits from temporal logic remains an open challenge. Acknowledging this gap, the annual Reactive Synthesis Competition (SYNTCOMP) added a track to compare and benchmark logic synthesis tools for Linear Temporal Logic (LTL) in 2016 [1].

This seminar aims to conduct a comprehensive literature survey of the SYNTCOMP problem statements, benchmarks, and submitted synthesis tools [2, 3]. Students will critically analyze the advantages and potential drawbacks of these tools and algorithms to provide a perspective on their applicability in digital design. Depending on the findings, the seminar can also be extended with other (more theoretical) approaches to temporal logic synthesis beyond SYNTCOMP.

 

[1] S. Jacobs and R. Bloem, “The Reactive Synthesis Competition: SYNTCOMP 2016 and Beyond,” 2016, https://arxiv.org/abs/1611.07626
[2] S. Jacobs, G. A. Pérez, and P. Schlehuber-Caissier, “The Reactive Synthesis Competition,” 2023, https://www.syntcomp.org/
[3] P. J. Meyer, S. Sickert, and M. Luttenberger, “Strix: Explicit Reactive Synthesis Strikes Back!,” 2018, https://strix.model.in.tum.de/publications/MeyerSL18.pdf

Contact

RobertNiklas.Kunzelmann@infineon.com, Wolfgang.Ecker@tum.de

Supervisor:

Conrad Foik

Efficient Transformer Models Using Low-Rank Representations

Description

Transformer models have become the cornerstone of many state-of-the-art solutions in natural language
processing (NLP) [1,2] and computer vision tasks [3,4]. These models, known for their self-attention
mechanisms, excel at capturing complex dependencies within data. However, the computational and
memory demands of transformer models present significant challenges, particularly for deployment in
resource-constrained environments.
Low-rank approximation methods offer a promising solution to mitigate these challenges by reducing the
dimensionality of the model parameters, effectively pruning the model while retaining most of its
performance. These methods decompose the weight matrices of the model into products of lower-rank
matrices, thereby reducing the number of parameters and computational complexity [5].


This seminar will delve into approaches that use low-rank methods for pruning transformer models,
highlighting the latest research [6,7,8] and future directions.


[1] A. Vaswani et al., “Attention is All you Need”, NeurIPS, 2017, vol. 30.
[2] OpenAI et al., “GPT-4 Technical Report”, arXiv [cs.CL]. 2023.
[3] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR, 2021.
[4] B. Cheng et al., “Masked-attention mask transformer for universal image segmentation.” CVPR, 2022.
[5] X. Liu and K. K. Parhi, “Tensor Decomposition for Model Reduction in Neural Networks: A Review,” IEEE Circuits Syst. Mag., vol.
23, no. 2, pp. 8–28, 2023, doi: 10.1109/MCAS.2023.3267921.
[6] S. Ren and K. Q. Zhu, “Low-Rank Prune-And-Factorize for Language Model Compression,” ICCL, 2024, pp. 10822–10832.
[7] Y. Guo et al., “PELA: Learning Parameter-Efficient Models with Low-Rank Approximation,” CVPR, 2024, pp. 15699–15709.
[8] C.-C. Chang et al., “FLORA: Fine-Grained Low-Rank Architecture Search for Vision Transformer”, WACV, 2024, pp. 2482–2491.

Contact

Please contact Moritz Thoma (Moritz.Thoma@bmw.de)

Supervisor:

Conrad Foik

Optimizing Vision Transformer Models: Techniques for Memory and Time-Efficient Pruning

Description

In recent years, vision transformers (ViTs) have revolutionized the field of computer vision, achieving
remarkable success across a wide range of tasks such as image classification [1], object detection [2],
and semantic segmentation [3]. Despite their impressive performance, vision transformers come with
significant computational costs and memory requirements, making them challenging to deploy in
resource-constrained environments. This is where model pruning - a technique aimed at reducing the size
and complexity of neural networks - comes into play. By selectively removing less important weights and
neurons, pruning can substantially reduce the computational burden and memory footprint of ViTs
without significantly affecting their accuracy. However, traditional pruning methods can be timeconsuming
and computationally intensive, which limits their practicality - especially for very large models.
Hence, memory and time-efficient pruning techniques are essential to make large models deployable
with reasonable compute effort.


This seminar aims to explore these advanced pruning strategies specifically tailored for vision
transformers [4,5], focusing on achieving a balance between model size reduction and computational
efficiency.


[1] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ICLR, 2021.
[2] N. Carion et al., “End-to-end object detection with transformers.” ECCV, 2020.
[3] B. Cheng et al., “Masked-attention mask transformer for universal image segmentation.” CVPR, 2022.
[4] W. Kwon et al., “A Fast Post-Training Pruning Framework for Transformers.” NeurIPS, 2022.
[5] M. Sun et al., “A Simple and Effective Pruning Approach for Large Language Models,” ICLR, 2024.

Contact

Please contact Moritz Thoma (Moritz.Thoma@bmw.de)

Supervisor:

Conrad Foik

Compression Techniques for Floating-Point Weights in Machine Learning Models

Description

 

Deep Neural Networks (DNNs) offer possibilities for tackling practical challenges and broadening the scope of Artificial Intelligence (AI) applications. The considerable computational and memory needs of current neural networks are attributed to the increasing complexity of network structures, which involve numerous layers containing millions of parameters. The energy consumption during the inference execution of deep neural networks (DNNs) is predominantly attributed to the access and processing of these parameters. To tackle the significant size of models integrated into Internet of Things (IoT) devices, a promising strategy involves diminishing the bit-width of weights.

 

The objective of this seminar is to conduct a comprehensive literature survey around compression techniques available for floating-point weights. Gather the advantages and disadvantages posed by the available solutions. Depending on the time and reviewed contents, the survey can be extended to find a hardware-efficient technique for the compression of floating-point weights.

 

 

Bibliography:

 

[1] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” 2019.
[2] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,”
2016.
[3] G. C. Marin ?o, G. Ghidoli, M. Frasca, and D. Malchiodi, “Compression strategies and space-conscious representations for deep neural networks,” in
2020 25th International Conference on Pattern Recognition (ICPR), 2021,
pp.9835–9842.
[4] G. C. Marin`o, A. Petrini, D. Malchiodi, and M. Frasca, “Compact representations of convolutional neural networks via weight pruning and
quantization,” CoRR, vol. abs/2108.12704, 2021. [Online]. Available: https://arxiv.org/abs/2108.12704

 

 

 

Contact

Supervisor:

Conrad Foik

On Memory Optimization of Tensor Programs

Short Description:
In this seminar the student will review state-of-the art memory-aware optimization techniques applied to tensor-level AI programs.

Description

Compact electronic edge devices have limited memory resources. As AI models can require large amounts of memory, running AI models on edge devices becomes challenging. Thus, optimizing AI programs that can be deployed on edge devices is necessary while saving costly memory transfers.

This need has motivated current works exploring different memory-aware optimization techniques that reduce memory utilization but do not modify the DNN parameters (as during compression or network architecture search (NAS)), such as fused tiling, memory-aware scheduling, and memory layout planning [1].   For instance, DORY (Deployment Oriented to memoRY) is an automated tool designed for deploying deep neural networks (DNNs) on low-cost microcontroller units with less than 1MB of on-chip SRAM memory. It tackles the challenge of tiling by framing it as a Constraint Programming (CP) problem, aiming to maximize the utilization of L1 memory while adhering to the topological constraints of each DNN layer. DORY then generates ANSI C code to manage the transfers between off-chip and on-chip memory and the computation phases [2].  DORY has been integrated with TVM to ease the support for heterogeneous compilation and offloading operations not supported by the accelerator to a regular host CPU [3].

This seminar topic reviews state-of-the-art approaches for memory-aware optimization techniques of ML tensor programs targeting constrained edge devices. The different methods and results shall be reviewed and compared.

References:

[1] Rafael Christopher Stahl.Code Optimization and Generation of Machine Learning and Driver Software for Memory-Constrained Edge Devices. 2024. Technical University of Munich, PhD Thesis. URL: https://mediatum.ub.tum.de/doc/1730282/1730282.pdf

[2] A. Burrello, A. Garofalo, N. Bruschi, G. Tagliavini, D. Rossi and F. Conti, "DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs," in IEEE Transactions on Computers, vol. 70, no. 8, pp. 1253-1268, 2021, https://doi.org/10.1109/TC.2021.3066883

 

[3] Van Delm, Josse, et al. "HTVM: Efficient neural network deployment on heterogeneous TinyML platforms." 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023. https://doi.org/10.1109/DAC56929.2023.10247664

Contact

Andrew.stevens@infineon.com 

Daniela.sanchezlopera@infineon.com  

 

Supervisor:

Daniela Sanchez Lopera - Andrew Stevens (Infineon Technologies )

On-device learning

Short Description:
In this seminar the student will review state-of-the art contributions to the on-device learning research area.

Description

TinyML is a research area aiming to bring machine learning models to resource-constrained IoT devices and microcontrollers. Current research mainly focuses on enabling inference on such devices, tackling challenges such as limited memory and computation resources available. But for specific sensing and IoT applications, on-device learning would allow retraining and refining ML models directly on small and low-power devices. However, on-device learning on edge devices is much more challenging than inference due to larger memory footprints and increased computing operations to store intermediate activations and gradients [1].

To tackle those challenges, different strategies involving, among others, quantization, sparse backpropagation, or new layer types have been proposed and summarized [1, 2]. This seminar will review state-of-the-art approaches for on-device learning techniques targeting constrained edge devices. The different methods and results shall be reviewed and compared.

 

References:

[1] J. Lin, L. Zhu, W. -M. Chen, W. -C. Wang and S. Han, "Tiny Machine Learning: Progress and Futures [Feature]," in IEEE Circuits and Systems Magazine, vol. 23, no. 3, pp. 8-34, 2023, https://doi.org/10.1109/MCAS.2023.3302182

[2] Shuai Zhu, Thiemo Voigt, Fatemeh Rahimian, and JeongGil Ko. 2024. On-device Training: A First Overview on Existing Systems. ACM Trans. Sen. Netw. Just Accepted (September 2024). https://doi.org/10.1145/3696003  

Contact

Supervisor:

Daniela Sanchez Lopera - Andrew Stevens (Infineon Technologies )

Innovative Memory Architectures in DNN Accelerators

Description

With the growing complexity of neural networks, more efficient and faster processing solutions are vital to enable the widespread use of artificial intelligence. Systolic arrays are among the most popular architectures for energy-efficient and high-throughput DNN hardware accelerators.

While many works implement DNN accelerators using systolic arrays on FPGAs, several (ASIC) designs from industry and academia have been presented [1-3]. To fulfill the requirements that such accelerators place on memory accesses, both in terms of data availability and latency hiding, innovative memory architectures can enable more efficient data access, reducing latency and bridging the gap towards even more powerful DNN accelerators.

One example is the Eyeriss v2 ASIC [1], which uses a distributed Global Buffer (GB) layout tailored to the demands of their row-stationary systolic array dataflow.

In this seminar, a survey of state-of-the-art DNN accelerator designs and design frameworks shall be created, focusing on their memory hierarchy.

References and Further Resources:

[1] Y. -H. Chen, T. -J. Yang, J. Emer and V. Sze. 2019 "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292-308, June 2019, doi: https://doi.org/10.1109/JETCAS.2019.2910232

[2] Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. "DianNao family: energy-efficient hardware accelerators for machine learning." In Commun. ACM 59, 11 (November 2016), 105–112. https://doi.org/10.1145/2996864

[3] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, et al. 2017. "In-Datacenter Performance Analysis of a Tensor Processing Unit." In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA '17). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3079856.3080246

[4] Rui Xu, Sheng Ma, Yang Guo, and Dongsheng Li. 2023. A Survey of Design and Optimization for Systolic Array-based DNN Accelerators. ACM Comput. Surv. 56, 1, Article 20 (January 2024), 37 pages. https://doi.org/10.1145/3604802

[5] Bo Wang, Sheng Ma, Shengbai Luo, Lizhou Wu, Jianmin Zhang, Chunyuan Zhang, and Tiejun Li. 2024. "SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow." ACM Trans. Des. Autom. Electron. Syst. 29, 2, Article 26 (March 2024), 32 pages. https://doi.org/10.1145/3634703
 

Contact

benedikt.schaible@tum.de

Supervisor:

Benedikt Schaible

From Tree to Bus: Modifying Obstacle-Avoiding Steiner Tree Algorithms for the Synthesis of Bus Topology

Description

The ultimate goal of this study is to generate a bus topology that minimizes wire length while considering obstacles. When examining general obstacle-aware routing problems considering wire length minimization, the most widely acknowledged automatic routing method is the Obstacle-Avoiding Steiner Minimum Tree (OASMT). The OASMT algorithm is typically used to generate tree topologies, connecting nodes through branching structures. To achieve bus topology, we aim to modify the existing OASMT algorithm by adjusting the node connection order so that it produces a bus topology structure. The task will focus solely on this modification process, changing the node connections to achieve a bus structure without involving further wire length minimization.

Contact

m.lian@tum.de

Supervisor:

Meng Lian

Automatic Mixed Precision Quantization Search

Description

Deep Neural Networks (DNNs) have shown high predictive performance on various tasks in computer vision and natural language processing. However, the high memory footprint and large compute requirements of DNNs restrict their potential deployment on embedded devices with limited memory. Quantization is a promising solution to enable such deployment by representing the floating-point parameters of DNNs in a smaller number of bits. 

Early works on quantization assign the same bits across all layers to compress a DNN, but this typically leads to significant accuracy degradation, especially for ultra-low bit precision. A more recent solution is to perform Mixed-Precision Quantization, where more sensitive layers are maintained at higher precision while less sensitive layers are reduced to a lower precision. As such, Mixed-Precision Quantization provides a better trade-off between model efficiency and accuracy.

Finding the optimal bit-width assignment for each layer in Mixed-Precision Quantization is challenging due to its exponentially huge search space. Recent literature proposes several promising frameworks to address this challenge [1, 2, 3, 4]. This seminar topic covers a literature research and survey on state-of-the-art methods for Mixed Precision Quantization, including the advantages and drawbacks of each of the methods.

 

[1] Rakka, Mariam, et al. "Mixed-precision neural networks: A survey." arXiv preprint arXiv:2208.06064 (2022).

[2] Yao, Zhewei, et al. "HAWQ-v3: Dyadic neural network quantization." International Conference on Machine Learning. PMLR, 2021.

[3] Tang, Chen, et al. "Mixed-precision neural network quantization via learned layer-wise importance." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.

[4] Pandey, Nilesh Prasad, et al. "A practical mixed precision algorithm for post-training quantization." arXiv preprint arXiv:2302.05397 (2023).

Contact

E-mail: mikhael.djajapermana@tum.de

Supervisor:

Mikhael Djajapermana

Comparative Study of Hardware Architectures for Neural Network Accelerators

Description

This literature review will focus on exploring and comparing different hardware architectures designed specifically for neural network accelerators, examining how each architecture is optimized for specific neural network tasks (e.g., convolutional neural networks (CNNs)).

The study could highlight the trade-offs between various design choices, such as parallelism, memory hierarchy, dataflow, flexibility, and integration with CPUs.

Contact

samira.ahmadifarsani@tum.de

Supervisor:

Samira Ahmadifarsani

Generally Applicable Power Estimation Methodologies for Accelerators

Description

Power Dissipation Modeling is getting increasingly important in integrated circuit design. Especially the increasing usage of dedicated accelerator components in electronic systems require precise power estimation of design choices on the architecture level in early design stages.

Different approaches, like data-driven models (e.g. [1]), have the goal to support diverse kinds of circuit designs. On the other hand a lot of specialized approaches are developed. However, having a generally applicable framework would give more oportunities for the integration in design flows. Here, frameworks, like [2] have been presented.

The goal of this seminar topic is to identify power estimation methodologies for accelerator designs, which are transferable across various design classes. A focus should be also set on the origin of the underlying power models and their generality.

 

References:

[1] ZHANG, Yanqing; REN, Haoxing; KHAILANY, Brucek. GRANNITE: Graph neural network inference for transferable power estimation. In: 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020. S. 1-6.

[2] WU, Yannan Nellie; EMER, Joel S.; SZE, Vivienne. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In: 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019. S. 1-8.

Contact

If you are interested in this topic, contatc me at:

philipp.fengler@tum.de

Supervisor:

Philipp Fengler

Thermal Modeling of Integrated Circuits on Architecture Level

Description

Although the end of Moore's Law is already predicted for the near future, integrated circuit design still continues miniaturization and increase in available computational power. However, non-functional measures, like power dissipation getting more and more attention.

Another increasingly important factor is heat generation. Thermal management [1] is needed to ensure the reliability of electronic devices. To enable efficient, reliable and performant circuits, hence, thermal characteristics need to be precisely estimated already in early design stages. This requires not only precise thermal characterization, like in [2], but also suitable thermal modeling methods (e.g. [3]).

The target of this seminar topic is to provide a comprehensive study, which thermal effects are necessary to be modeled in early design stages and which modeling approaches are already developed.

References:

[1] GARIMELLA, Suresh V., et al. Electronics thermal management in information and communications technologies: Challenges and future directions. IEEE Transactions on Components, Packaging and Manufacturing Technology, 2016, 7. Jg., Nr. 8, S. 1191-1205.   [2] MESA-MARTINEZ, Francisco Javier; ARDESTANI, Ehsan K.; RENAU, Jose. Characterizing processor thermal behavior. ACM SIGARCH Computer Architecture News, 2010, 38. Jg., Nr. 1, S. 193-204.   [3] STAN, Mircea R., et al. Hotspot: A dynamic compact thermal model at the processor-architecture level. Microelectronics Journal, 2003, 34. Jg., Nr. 12, S. 1153-1165.

 

Contact

If you are interested in this topic, contact me at:

philipp.fengler@tum.de

Supervisor:

Philipp Fengler

Design Space Exploration Methods for Neural Network Accelerators

Short Description:

Description

The efficiency of an accelerator depends on three factors—mapping, deep neural network (DNN) layers, and hardware. The process of hardware design space exploration requires both hardware parameters and mappings from the algorithm onto the target hardware to be discovered and optimized. 

This project aims to identify the most prominent approaches and compare them.

Contact

samira.ahmadifarsani@tum.de

Supervisor:

Samira Ahmadifarsani

Pre-training Network Pruning

Short Description:
In this seminar the student will review state-of-the art pruning techniques applied before training such as SNIP.

Description

“Pruning large neural networks while maintaining their performance is often desirable due to the reduced space and time complexity. Conventionally, pruning is done within an iterative optimization procedure with either heuristically designed pruning schedules or additional hyperparameters during training or using statistically heuristics after training. However, using suitable heuristic criteria, inspired by the “Lottery Ticket” hypothesis networks can also be pruned before training. This eliminates the need for both pretraining and the complex pruning schedules and is well suited to use in combination with neural architecture search. making it robust to architecture variations. The canonical method SNIP [1] introduces a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task. These methods can obtain extremely sparse networks and are claimed to retain the same accuracy as reference network on benchmark classification tasks.” As such pre-training pruning methods are potentially a highly attractive alternative to post-training training-time co-optimization methods for use in automated industrial machine learning deployment toolchains. References: [1] Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip HS Torr. "Snip: Single-shot network pruning based on connection sensitivity." arXiv 2018. https://arxiv.org/abs/1810.02340 [2] Artem Vysogorets and Julia Kempe .“Connectivity Matters: “Neural Network Pruning Through the Lens of Effective Sparsity.” https://www.jmlr.org/papers/volume24/22-0415/22-0415.pdf [3] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin: “Pruning Neural Networks at Initialization: Why are We Missing the Mark?” ICLR 2021. https://arxiv.org/abs/2009.08576 [4] Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania: “Progressive Skeletonization: Trimming more fat from a network at initialization”. https://arxiv.org/abs/2006.09081

Contact

Andrew.stevens@infineon.com

Daniela.sanchezlopera@infineon.com 

Supervisor:

Daniela Sanchez Lopera - Andrew Stevens (Infineon Technologies AG)

Tensor program optimization for machine learning models

Short Description:
In this seminar the student will review state-of-the art tools allowing to perform tensor optimizations for machine learning (ML) models specially focusing on TVM-related approaches.

Description

The widespread use of ML in many real-life applications is enabled by deep learning models optimized and deployed to specific hardware platforms and devices. Typically, deep learning frameworks depend on libraries that have been manually optimized. Engineers need to choose from many tensor programs that are logically equivalent but differ significantly in performance due to memory access, threading, and the use of specialized hardware primitives. This selection and optimization process are quite labor-intensive. Thanks to the increase of model architectures, their sizes and hardware targets or backends, tensor program optimization has become a key factor for the efficient deployment of ML models. In this seminar paper, an extensive survey of state-of-the-art tools for tensor program optimization shall be done specially focusing on tools leveraging the ML compiler TVM, such as AutoTVM, Ansor [2], MetaSchedule[3], or others comparing against TVM, such as TensorIR [4]. Key areas of focus are: - Understanding core idea of the reviewed methods - Examining and comparing reported results - Assessing the advantages of using those methods and potential drawbacks References: [1] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs (NIPS'18). https://proceedings.neurips.cc/paper_files/paper/2018/hash/8b5700012be65c9da25f49408d959ca0- Abstract.html [2] Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. 2020. Ansor: generating high-performance tensor programs for deep learning (OSDI'20). https://www.usenix.org/conference/osdi20/presentation/zheng [3] Shao, Junru, et al. "Tensor program optimization with probabilistic programs." Advances in Neural Information Processing Systems 35 (2022): 35783-35796. https://proceedings.neurips.cc/paper_files/paper/2022/hash/e894eafae43e68b4c8dfdacf742bcbf3- Abstract-Conference.html [4] Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, and Tianqi Chen. 2023. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. (ASPLOS 2023). https://dl.acm.org/doi/abs/10.1145/3575693.3576933 

Contact

  • Samira.ahmadifarsani@tum.de
  • Daniela.sanchezlopera@infineon.com

Supervisor:

Samira Ahmadifarsani

Micromixing within Microfluidic Devices: Fundamentals, Design, and Fabrication

Description

Micromixing refers to the process of mixing fluids at the microscale, a critical function in microfluidic devices. Efficient micromixing is essential for various applications, including chemical reactions, biological assays, and diagnostic tests, where precise control over fluid interactions is required. The unique challenges of micromixing arise from the laminar flow conditions typically present at the microscale, which necessitate innovative design and fabrication strategies to achieve effective mixing.

 

In this seminar, you will investigate the fundamental principles of micromixing within microfluidic devices, emphasizing the physical and chemical phenomena that govern fluid behavior at the microscale. You will explore various design strategies aimed at enhancing micromixing efficiency, such as passive and active mixing techniques and the use of microstructures and external forces. You will also cover the fabrication methods employed to create microfluidic devices with optimized mixing capabilities, including lithography, soft lithography, and 3D printing. By examining these aspects, this seminar provides a comprehensive understanding of the current state of micromixing technology and its potential future developments.

Contact

Yushen.Zhang+Seminar@cit.tum.de

Supervisor:

Yushen Zhang

Microfuidic chips: recent advances, critical strategies in design, applications and future perspectives

Description

Microfluidics is a multidisciplinary field that involves the manipulation of fluids at the microscale. It integrates principles from physics, chemistry, engineering, and biotechnology to develop devices, known as microfluidic chips, that can precisely control and analyze small volumes of fluids. These chips have revolutionized various scientific and industrial processes due to their high efficiency, reduced reagent consumption, and rapid processing times.

In this seminar, you should explore the recent advances in microfluidic chip technology, focusing on innovative design strategies that enhance their functionality and performance. You should go into the critical aspects of microfluidic chip design, including material selection, fabrication techniques, and integration with other technologies. You will also examine the diverse applications of microfluidic chips in fields such as medical diagnostics, drug development, environmental monitoring, and chemical synthesis. Finally, you will provide a comprehensive outlook on future perspectives, highlighting emerging trends and potential challenges in the continued evolution of microfluidic technology.

Contact

Yushen.Zhang+Seminar@cit.tum.de

Supervisor:

Yushen Zhang

Survey: Methods for Worst-Case-Execution-Time Analysis of Embedded Software

Description

To meet the strict requirements of modern embedded real-time systems, software designers rely on efficient methods to determine the worst-case-execution-time (WCET) of their programs. The literature has proposed both static and dynamic methods.

The goal of this project is to identify the most prominent approaches and to compare them.

Contact

conrad.foik@tum.de

Supervisor:

Conrad Foik

Post-processing Flow-Layer Routing with Length-Matching Constraint for Flow-Based Microfluidic Biochips

Description

 

Here's a consolidated project description based on your provided information:

This project addresses the challenges in the current process of synthesizing microfluidic chips, particularly focusing on the gap in the complete synthesis flow which can lead to reduced performance, resource wastage, or infeasible designs. The general synthesis process typically involves three stages: high-level synthesis, followed by the design of the flow layer, and finally, the design of the control layer.

Current state-of-the-art synthesis methods, primarily operating at the operation- and device-level, make assumptions regarding the availability of fluid transportation paths. They often overlook the physical layout of control and flow channels and neglect the flow rate. This oversight can lead to biased scheduling of fluid transportation time during synthesis.

Our project proposes an innovative approach to bridge this gap. By considering the known physical design of microfluidic chips and the desired experiments, represented as sequence graphs, we aim to improve the physical design. The approach involves adjusting the lengths of the channels according to the required fluid volume. This adjustment is expected to reduce the number of valves and control ports in the original physical design, thereby enhancing the efficiency and feasibility of microfluidic chip designs.

Contact

m.lian@tum.de

Supervisor:

Meng Lian

Checksum-based Error Detection for Reliable Computing

Description

In safety-critical systems, random hardware faults, such as transient soft errors (e.g., due to radiation) or permanent circuit faults, can lead to disastrous failures. Detecting these errors is, therefore, one major design goal. A state-of-the-art solution is redundancy, where a computation is performed multiple times, and their respective results are compared. This can be achieved either sequentially (temporal redundancy) or at the same time, e.g., through lock-stepped computational units (spatial redundancy). The baseline is that a fault does not happen in a close vicinity to a former one.

However, this redundancy method introduces a significant overhead to the system: The required multiplicity of computational resources - execution time or processing nodes. Checksum-based computation aims to mitigate the amount of computational overhead by introducing redundancy into the algorithms, e.g., filter and input checksums for convolution algorithms.

Supervisor:

Johannes Geier

Arithmetic Code-based Error Detection for Reliable Computing

Description

In safety-critical systems, random hardware faults, such as transient soft errors (e.g., due to radiation) or permanent circuit faults, can lead to disastrous failures. Detecting these errors is, therefore, one major design goal. A state-of-the-art solution is redundancy, where a computation is performed multiple times, and their respective results are compared. This can be achieved either sequentially (temporal redundancy) or at the same time, e.g., through lock-stepped computational units (spatial redundancy). The baseline is that a fault does not happen in a close vicinity to a former one. However, this redundancy method introduces a significant overhead to the system: The required multiplicity of computational resources - execution time or processing nodes.

Code-based computation aims to mitigate the amount of computational overhead without loss in sensitivity. Notable examples are AN-codes and residue-based error detection.

 

Supervisor:

Johannes Geier

Alternative Optimization Methods for Training of Neural Networks

Description

To this day, the go to algorithms for training neural netowrks are stochastic gradient descent or some flavour of the Adam family. Other algorithms, such as quasi-Newton or derivative free methods are less requently used, although they can, under certain circumstances, bring better results. The goal of this work is to write a survey on the usage of alternative methods for taining, such as quasi-Newton or derivative free algorithms.

Prerequisites

Knowledge in optimization and basic linear algebra is recommended.

Contact

markus.leibl@tum.de

Supervisor:

Markus Leibl

Performance Comparison of Derivative Free Optimization Algorithms

Description

 

One of the most common approaches in optimization is to make use of first and second derivatives, in order to find a minimum of a target function. In many cases though, this is not a viable option, as derivatives might not be available, or the function might not be differentiable at all.
For such problems, we need algorithms, that work without derivatives. This can be achieved for example using finite differences or interpolation.
One important algorithm family are the algorithms NEWUOA and BOBYQA by M. J. D. Powell.
The goal of this seminar is to gather works, that evaluate the performance of those algorithms. Furthermore, we want to create a performance matrix that puts Powell’s algorithms into relation with traditional algorithms over a range of different optimization problems.

 

Paper for BOBYQA (Has not to be read enirely!):
www.damtp.cam.ac.uk/user/na/NA_papers/NA2009_06.pdf

 

 

 

Prerequisites

Knowledge in optimization and basic linear algebra is recommended.

Contact

markus.leibl@tum.de

Supervisor:

Markus Leibl

A polynomial time optimal diode insertion/routing algorithm for fixing antenna problem

Description

Abstract— Antenna problem is a phenomenon of plasma induced gate oxide degradation. It directly affects manufacturability of VLSI circuits, especially in deep-submicron technology using high density plasma. Diode insertion is a very effective way to solve this problem Ideally diodes are inserted directly under the wires that violate antenna rules. But in today's high-density VLSI layouts, there is simply not enough room for "under-the-wire" diode insertion for all wires. Thus it is necessary to insert many diodes at legal "off-wire" locations and extend the antenna-rule violating wires to connect to their respective diodes. Previously only simple heuristic algorithms were available for this diode insertion and routing problem. In this paper we show that the diode insertion and routing problem for an arbitrary given number of routing layers can be optimally solved in polynomial time. Our algorithm guarantees to find a feasible diode insertion and routing solution whenever one exists. Moreover we can guarantee to find a feasible solution to minimize a cost function of the form /spl alpha/ /spl middot/ L + /spl beta/ /spl middot/ N where L is the total length of extension wires and N is the total number of Was on the extension wires. Experimental results show that our algorithm is very efficient.

Contact

alex.truppel@tum.de

Supervisor:

Alexandre Truppel