Bachelor's Theses
Neural Architecture Search for Efficient Vision Transformer
Description
Vision Transformers (ViTs) have shown superior predictive performance than traditional Convolutional Neural Networks (CNNs) in vision applications [1,2]. However, typical ViTs demands large memory and compute due to their huge number of parameters. This restricts their applicability on edge devices with limited memory. Neural Architecture Search (NAS) is a method to automate the design of neural network. Numerous NAS frameworks have effectively found high-performing, efficient neural networks that can be deployed on edge devices [3].
Once-For-All (OFA) [4] proposes a NAS framework that employs progressive shrinking to effectively train a SuperNet as a one-time cost. Specialized sub-networks can then be derived from the SuperNet without additional training. However, OFA has only been applied in CNN-based search spaces.
The project aims to develop a NAS framework to find specialized ViT-based networks for TinyML applications [5]. We are interetested in automated design of efficient ViTs with high accuracy performance and low computational and memory overhead.
[1] Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[2] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International conference on machine learning. PMLR, 2021.
[3] Benmeziane, Hadjer, et al. "A comprehensive survey on hardware-aware neural architecture search." arXiv preprint arXiv:2101.09336 (2021).
[4] Cai, Han, et al. "Once-for-all: Train one network and specialize it for efficient deployment." arXiv preprint arXiv:1908.09791 (2019).
[5] Banbury, Colby, et al. "MLPerf tiny benchmark." arXiv preprint arXiv:2106.07597 (2021).
Prerequisites
- Proficiency in Python.Familiarity with deep learning libraries, such as PyTorch, is a plus.
- Good knowledge of deep neural networks architectures.
- Experience with Vision Transformers would be of advantage.
Contact
If inteteresed, please apply and attach your CV and current transcipt of records to:
mikhael.djajapermana@tum.de
Supervisor:
Master's Theses
Performance Evaluation of RISC-V Processors: Various Topics
Description
RISC-V is an emerging processor instruction set architecture (ISA) gaining increasing momentum in both industry and academia. Its open-source character allows vendors to specify their own processor microarchitecture, thus designing processors meeting their specific custom requirements.
In order to take full advantage of this possibility, designers frequently rely on early high-level simulations to identify optimal solutions. Performance (in terms of timing) is a key metric to consider during these simulations, as it affects both the processor's capability to meet timing requirements and energy efficiency.
The ETISS performance simulator, developed at TUM, is capable of providing highly accurate performance estimates early on during the design phase. It can be quickly adapted to new microarchitecture variants through a custom domain-specific language (DSL).
We are now looking for curious students to support our effort to further strengthen the simulator and its analysis environment. If you are generally interested in this research area, do not hesitate to reach out to us to discuss possible projects.
Prerequisites
Requirements for this project are:
- Fundamental knowledge of microarchitecture concepts
- Programming experience (Python or C++ are beneficial)
- Basic knowledge or interest in learning about RISC-V
Contact
Interested? Please reach out to: conrad.foik@tum.de
Please remember to include your transcript of records.
Supervisor:
Accelerating Fault Simulation at RTL on GPU Compute Clusters
RTL, fault injection, GPU, Safety, Security
Description
One of the crucial tasks in designing, testing, and verifying a digital system is the early estimation of its fault tolerance. For this, fault injection simulations can be used to evaluate this tolerance at different phases of development. At the Register Transfer Level (RTL), an existing but not yet implemented hardware design can be simulated with higher accuracy than its simulation at the instruction or algorithm level. However, accuracy comes at a cost that grows with the number of simulations done within a fault injection analysis. For example, fault injection simulation of a CPU at RTL instead of Instruction Set Architecture (ISA) level would increase the simulation effort to include micro-architectural registers (pipeline, functional units, etc.),
To allow faster fault space exploration at RTL, one could do the following: (a) accelerate the simulation of an individual Device Under Test (DUT) and fault, or
(b) launch multiple fault simulations concurrently
In the case of (a), state-of-the-art research on fast RTL simulations has aimed to reduce simulation cost by multi-threaded simulation on CPUs [1][2][3] or GPUs [4][5][8].
In case (b), multiple independent simulations may be launched simultaneously on a distributed compute cluster. Parallelization of an individual simulation is not as substantial, since the computational platform is utilized anyway, i.e., task-level parallelism of individual simulations for long simulations: one CPU core per fault experiment.
Existing solutions for (b) mainly aim for CPU-based clusters [6][7], whereas for (a), the maximum speed of a single DUT simulation is required. In this work, we want to explore efficient fault simulation at RTL on GPU-based compute clusters that maximizes utilization with the number of individual experiments launched.
Tasks:
- Set up GPU-accelerated RTL simulation of [8][9]
- Implement a new fault injection arbitration framework for the RTL simulator
- Explore multi-experiment partitioning for fault simulation on the new platform
- Compare and benchmark against a CPU-based fault exploration scheme [6][7]
References:
- [1] W. Snyder, P. Wasson, D. Galbi, et al. Verilator. https://github.com/verilator/verilator, 2019. [Online].
- [2] S. Beamer and D. Donofrio, "Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation," 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2020, pp. 1-6, doi: 10.1109/DAC18072.2020.9218632.
- [3] Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and Ru Huang. 2023. Khronos: Fusing Memory Access for Improved Hardware RTL Simulation. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '23). Association for Computing Machinery, New York, NY, USA, 180–193. https://doi.org/10.1145/3613424.3614301
- [4] H. Qian and Y. Deng, "Accelerating RTL simulation with GPUs," 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, 2011, pp. 687-693, doi: 10.1109/ICCAD.2011.6105404.
- [5] Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei Huang. 2023. From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. In Proceedings of the 51st International Conference on Parallel Processing (ICPP '22). Association for Computing Machinery, New York, NY, USA, Article 88, 1–12. https://doi.org/10.1145/3545008.3545091
- [6] Johannes Geier and Daniel Mueller-Gritschneder. 2023. VRTLmod: An LLVM based Open-source Tool to Enable Fault Injection in Verilator RTL Simulations. In Proceedings of the 20th ACM International Conference on Computing Frontiers (CF '23). Association for Computing Machinery, New York, NY, USA, 387–388. https://doi.org/10.1145/3587135.3591435
- [7] J. Geier, L. Kontopoulos, D. Mueller-Gritschneder and U. Schlichtmann, "Rapid Fault Injection Simulation by Hash-Based Differential Fault Effect Equivalence Checks," 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025, pp. 1-7, doi: 10.23919/DATE64628.2025.10993266.
- [8] Guo, Zizheng, et al. "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation.https://guozz.cn/publication/gemdac-25/gemdac-25.pdf
- [9] Github NVLabs https://github.com/NVlabs/GEM
Prerequisites
- Excellent C++, Python
- Good understanding of GPU programming (CUDA) or interest to learn
- Decent knowledge of hardware design languages (Verilog, VHDL) and EDA tools (Vivado, Yosys)
- Decent knowledge of Statistics and probability
Contact
Apply with CV and Transcript of Records directly to:
johannes.geier@tum.de
Supervisor:
Neural Architecture Search for Efficient Vision Transformer
Description
Vision Transformers (ViTs) have shown superior predictive performance than traditional Convolutional Neural Networks (CNNs) in vision applications [1,2]. However, typical ViTs demands large memory and compute due to their huge number of parameters. This restricts their applicability on edge devices with limited memory. Neural Architecture Search (NAS) is a method to automate the design of neural network. Numerous NAS frameworks have effectively found high-performing, efficient neural networks that can be deployed on edge devices [3].
Once-For-All (OFA) [4] proposes a NAS framework that employs progressive shrinking to effectively train a SuperNet as a one-time cost. Specialized sub-networks can then be derived from the SuperNet without additional training. However, OFA has only been applied in CNN-based search spaces.
The project aims to develop a NAS framework to find specialized ViT-based networks for TinyML applications [5]. We are interetested in automated design of efficient ViTs with high accuracy performance and low computational and memory overhead.
[1] Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[2] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International conference on machine learning. PMLR, 2021.
[3] Benmeziane, Hadjer, et al. "A comprehensive survey on hardware-aware neural architecture search." arXiv preprint arXiv:2101.09336 (2021).
[4] Cai, Han, et al. "Once-for-all: Train one network and specialize it for efficient deployment." arXiv preprint arXiv:1908.09791 (2019).
[5] Banbury, Colby, et al. "MLPerf tiny benchmark." arXiv preprint arXiv:2106.07597 (2021).
Prerequisites
- Proficiency in Python.Familiarity with deep learning libraries, such as PyTorch, is a plus.
- Good knowledge of deep neural networks architectures.
- Experience with Vision Transformers would be of advantage.
Contact
If inteteresed, please apply and attach your CV and current transcipt of records to:
mikhael.djajapermana@tum.de
Supervisor:
Interdisciplinary Projects
Accelerating Fault Simulation at RTL on GPU Compute Clusters
RTL, fault injection, GPU, Safety, Security
Description
One of the crucial tasks in designing, testing, and verifying a digital system is the early estimation of its fault tolerance. For this, fault injection simulations can be used to evaluate this tolerance at different phases of development. At the Register Transfer Level (RTL), an existing but not yet implemented hardware design can be simulated with higher accuracy than its simulation at the instruction or algorithm level. However, accuracy comes at a cost that grows with the number of simulations done within a fault injection analysis. For example, fault injection simulation of a CPU at RTL instead of Instruction Set Architecture (ISA) level would increase the simulation effort to include micro-architectural registers (pipeline, functional units, etc.),
To allow faster fault space exploration at RTL, one could do the following: (a) accelerate the simulation of an individual Device Under Test (DUT) and fault, or
(b) launch multiple fault simulations concurrently
In the case of (a), state-of-the-art research on fast RTL simulations has aimed to reduce simulation cost by multi-threaded simulation on CPUs [1][2][3] or GPUs [4][5][8].
In case (b), multiple independent simulations may be launched simultaneously on a distributed compute cluster. Parallelization of an individual simulation is not as substantial, since the computational platform is utilized anyway, i.e., task-level parallelism of individual simulations for long simulations: one CPU core per fault experiment.
Existing solutions for (b) mainly aim for CPU-based clusters [6][7], whereas for (a), the maximum speed of a single DUT simulation is required. In this work, we want to explore efficient fault simulation at RTL on GPU-based compute clusters that maximizes utilization with the number of individual experiments launched.
Tasks:
- Set up GPU-accelerated RTL simulation of [8][9]
- Implement a new fault injection arbitration framework for the RTL simulator
- Explore multi-experiment partitioning for fault simulation on the new platform
- Compare and benchmark against a CPU-based fault exploration scheme [6][7]
References:
- [1] W. Snyder, P. Wasson, D. Galbi, et al. Verilator. https://github.com/verilator/verilator, 2019. [Online].
- [2] S. Beamer and D. Donofrio, "Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation," 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2020, pp. 1-6, doi: 10.1109/DAC18072.2020.9218632.
- [3] Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and Ru Huang. 2023. Khronos: Fusing Memory Access for Improved Hardware RTL Simulation. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '23). Association for Computing Machinery, New York, NY, USA, 180–193. https://doi.org/10.1145/3613424.3614301
- [4] H. Qian and Y. Deng, "Accelerating RTL simulation with GPUs," 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, 2011, pp. 687-693, doi: 10.1109/ICCAD.2011.6105404.
- [5] Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei Huang. 2023. From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. In Proceedings of the 51st International Conference on Parallel Processing (ICPP '22). Association for Computing Machinery, New York, NY, USA, Article 88, 1–12. https://doi.org/10.1145/3545008.3545091
- [6] Johannes Geier and Daniel Mueller-Gritschneder. 2023. VRTLmod: An LLVM based Open-source Tool to Enable Fault Injection in Verilator RTL Simulations. In Proceedings of the 20th ACM International Conference on Computing Frontiers (CF '23). Association for Computing Machinery, New York, NY, USA, 387–388. https://doi.org/10.1145/3587135.3591435
- [7] J. Geier, L. Kontopoulos, D. Mueller-Gritschneder and U. Schlichtmann, "Rapid Fault Injection Simulation by Hash-Based Differential Fault Effect Equivalence Checks," 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025, pp. 1-7, doi: 10.23919/DATE64628.2025.10993266.
- [8] Guo, Zizheng, et al. "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation.https://guozz.cn/publication/gemdac-25/gemdac-25.pdf
- [9] Github NVLabs https://github.com/NVlabs/GEM
Prerequisites
- Excellent C++, Python
- Good understanding of GPU programming (CUDA) or interest to learn
- Decent knowledge of hardware design languages (Verilog, VHDL) and EDA tools (Vivado, Yosys)
- Decent knowledge of Statistics and probability
Contact
Apply with CV and Transcript of Records directly to:
johannes.geier@tum.de
Supervisor:
Research Internships (Forschungspraxis)
Performance Evaluation of RISC-V Processors: Various Topics
Description
RISC-V is an emerging processor instruction set architecture (ISA) gaining increasing momentum in both industry and academia. Its open-source character allows vendors to specify their own processor microarchitecture, thus designing processors meeting their specific custom requirements.
In order to take full advantage of this possibility, designers frequently rely on early high-level simulations to identify optimal solutions. Performance (in terms of timing) is a key metric to consider during these simulations, as it affects both the processor's capability to meet timing requirements and energy efficiency.
The ETISS performance simulator, developed at TUM, is capable of providing highly accurate performance estimates early on during the design phase. It can be quickly adapted to new microarchitecture variants through a custom domain-specific language (DSL).
We are now looking for curious students to support our effort to further strengthen the simulator and its analysis environment. If you are generally interested in this research area, do not hesitate to reach out to us to discuss possible projects.
Prerequisites
Requirements for this project are:
- Fundamental knowledge of microarchitecture concepts
- Programming experience (Python or C++ are beneficial)
- Basic knowledge or interest in learning about RISC-V
Contact
Interested? Please reach out to: conrad.foik@tum.de
Please remember to include your transcript of records.
Supervisor:
Accelerating Fault Simulation at RTL on GPU Compute Clusters
RTL, fault injection, GPU, Safety, Security
Description
One of the crucial tasks in designing, testing, and verifying a digital system is the early estimation of its fault tolerance. For this, fault injection simulations can be used to evaluate this tolerance at different phases of development. At the Register Transfer Level (RTL), an existing but not yet implemented hardware design can be simulated with higher accuracy than its simulation at the instruction or algorithm level. However, accuracy comes at a cost that grows with the number of simulations done within a fault injection analysis. For example, fault injection simulation of a CPU at RTL instead of Instruction Set Architecture (ISA) level would increase the simulation effort to include micro-architectural registers (pipeline, functional units, etc.),
To allow faster fault space exploration at RTL, one could do the following: (a) accelerate the simulation of an individual Device Under Test (DUT) and fault, or
(b) launch multiple fault simulations concurrently
In the case of (a), state-of-the-art research on fast RTL simulations has aimed to reduce simulation cost by multi-threaded simulation on CPUs [1][2][3] or GPUs [4][5][8].
In case (b), multiple independent simulations may be launched simultaneously on a distributed compute cluster. Parallelization of an individual simulation is not as substantial, since the computational platform is utilized anyway, i.e., task-level parallelism of individual simulations for long simulations: one CPU core per fault experiment.
Existing solutions for (b) mainly aim for CPU-based clusters [6][7], whereas for (a), the maximum speed of a single DUT simulation is required. In this work, we want to explore efficient fault simulation at RTL on GPU-based compute clusters that maximizes utilization with the number of individual experiments launched.
Tasks:
- Set up GPU-accelerated RTL simulation of [8][9]
- Implement a new fault injection arbitration framework for the RTL simulator
- Explore multi-experiment partitioning for fault simulation on the new platform
- Compare and benchmark against a CPU-based fault exploration scheme [6][7]
References:
- [1] W. Snyder, P. Wasson, D. Galbi, et al. Verilator. https://github.com/verilator/verilator, 2019. [Online].
- [2] S. Beamer and D. Donofrio, "Efficiently Exploiting Low Activity Factors to Accelerate RTL Simulation," 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2020, pp. 1-6, doi: 10.1109/DAC18072.2020.9218632.
- [3] Kexing Zhou, Yun Liang, Yibo Lin, Runsheng Wang, and Ru Huang. 2023. Khronos: Fusing Memory Access for Improved Hardware RTL Simulation. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '23). Association for Computing Machinery, New York, NY, USA, 180–193. https://doi.org/10.1145/3613424.3614301
- [4] H. Qian and Y. Deng, "Accelerating RTL simulation with GPUs," 2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Jose, CA, USA, 2011, pp. 687-693, doi: 10.1109/ICCAD.2011.6105404.
- [5] Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, and Tsung-Wei Huang. 2023. From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus. In Proceedings of the 51st International Conference on Parallel Processing (ICPP '22). Association for Computing Machinery, New York, NY, USA, Article 88, 1–12. https://doi.org/10.1145/3545008.3545091
- [6] Johannes Geier and Daniel Mueller-Gritschneder. 2023. VRTLmod: An LLVM based Open-source Tool to Enable Fault Injection in Verilator RTL Simulations. In Proceedings of the 20th ACM International Conference on Computing Frontiers (CF '23). Association for Computing Machinery, New York, NY, USA, 387–388. https://doi.org/10.1145/3587135.3591435
- [7] J. Geier, L. Kontopoulos, D. Mueller-Gritschneder and U. Schlichtmann, "Rapid Fault Injection Simulation by Hash-Based Differential Fault Effect Equivalence Checks," 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025, pp. 1-7, doi: 10.23919/DATE64628.2025.10993266.
- [8] Guo, Zizheng, et al. "GEM: GPU-Accelerated Emulator-Inspired RTL Simulation.https://guozz.cn/publication/gemdac-25/gemdac-25.pdf
- [9] Github NVLabs https://github.com/NVlabs/GEM
Prerequisites
- Excellent C++, Python
- Good understanding of GPU programming (CUDA) or interest to learn
- Decent knowledge of hardware design languages (Verilog, VHDL) and EDA tools (Vivado, Yosys)
- Decent knowledge of Statistics and probability
Contact
Apply with CV and Transcript of Records directly to:
johannes.geier@tum.de
Supervisor:
FPGA Prototyping of the Gemmini Accelerator
ML Acceleration, Deployment, FPGA Prototyping
Description
Gemmini is an open-source, flexible, and efficient systolic-array-based accelerator designed to speed up machine learning workloads. While functional and cycle-accurate simulators for Gemmini are already available, we are now moving towards hardware prototyping by deploying its existing hardware descriptions onto a Xilinx UltraScale FPGA board.
The project will focus on implementing Gemmini on FPGA, targeting it using the TVM framework — an end-to-end compiler stack that optimizes and deploys machine learning models on edge devices and accelerators. A TVM backend for Gemmini that generates C code is already developed, enabling exploration of running a variety of machine learning models directly on the FPGA prototype.
References:
https://github.com/ucb-bar/gemmini
https://people.eecs.berkeley.edu/~ysshao/assets/papers/genc2021-dac.pdf
Prerequisites
- Strong experience with FPGAs and hardware prototyping
-
Solid background in embedded systems development
- Fundamental understanding of neural networks and embedded systems
- Basic understanding of the TVM compiler stack and its role in optimizing and deploying models to edge devices
- Proficiency in C/C++ and Python programming
- Self-motivation and ability to work independently
Contact
If you are interested, please contact me at the email address below and attach your CV and transcript.
samira.ahmadifarsani@tum.de
Supervisor:
Neural Architecture Search for Efficient Vision Transformer
Description
Vision Transformers (ViTs) have shown superior predictive performance than traditional Convolutional Neural Networks (CNNs) in vision applications [1,2]. However, typical ViTs demands large memory and compute due to their huge number of parameters. This restricts their applicability on edge devices with limited memory. Neural Architecture Search (NAS) is a method to automate the design of neural network. Numerous NAS frameworks have effectively found high-performing, efficient neural networks that can be deployed on edge devices [3].
Once-For-All (OFA) [4] proposes a NAS framework that employs progressive shrinking to effectively train a SuperNet as a one-time cost. Specialized sub-networks can then be derived from the SuperNet without additional training. However, OFA has only been applied in CNN-based search spaces.
The project aims to develop a NAS framework to find specialized ViT-based networks for TinyML applications [5]. We are interetested in automated design of efficient ViTs with high accuracy performance and low computational and memory overhead.
[1] Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[2] Touvron, Hugo, et al. "Training data-efficient image transformers & distillation through attention." International conference on machine learning. PMLR, 2021.
[3] Benmeziane, Hadjer, et al. "A comprehensive survey on hardware-aware neural architecture search." arXiv preprint arXiv:2101.09336 (2021).
[4] Cai, Han, et al. "Once-for-all: Train one network and specialize it for efficient deployment." arXiv preprint arXiv:1908.09791 (2019).
[5] Banbury, Colby, et al. "MLPerf tiny benchmark." arXiv preprint arXiv:2106.07597 (2021).
Prerequisites
- Proficiency in Python.Familiarity with deep learning libraries, such as PyTorch, is a plus.
- Good knowledge of deep neural networks architectures.
- Experience with Vision Transformers would be of advantage.
Contact
If inteteresed, please apply and attach your CV and current transcipt of records to:
mikhael.djajapermana@tum.de
Supervisor:
Student Assistant Jobs
Working Student (SHK): Transferring Laboratory Course Material to new FPGA Systems
FPGA, HLS, Linux, RISC-V
Description
Task: Assist in transferring the course material of Synthesis of Digital Systems laboratory to new RISC-V-based FPGA development kits:
- Set up template projects in FPGA development software
- Update manuals to the new tooling guides
Duration either ...
* 8h/week [01.10.2025-31.01.2026]
* 20h/week [01.10.2025-21.11.2025]
Prerequisites
- Experience working with FPGAs, preferably Microsemi devices
- Successfully completed Synthesis of Digital Systems (MSEI, MSCE, BSEI) course
- C/C++, python
- Linux command line (bash) in Debian distros
- embedded OS: Yocto Linux, Raspberry Pi, etc.
Contact
johannes.geier@tum.de