Open Theses

Important remark on this page

The following list is by no means exhaustive or complete. There is always some student work to be done in various research projects, and many of these projects are not listed here.

Don't hesitate to drop an email to any member of the chair asking for currently available topics in their field of research. Or you can email to this email-ID, which will be automatically broadcasted to all members of the chair.

Also, subscribe to the chair's open thesis topics list. So that when a new topic is posted, you can get a notification. Click here for the subscription.  

Abbreviations:

  • PhD = PhD Dissertation
  • BA = Bachelorarbeit, Bachelor's Thesis
  • MA = Masterarbeit, Master's Thesis
  • GR = Guided Research
  • CSE = Computational Science and Engineering

Cloud Computing / Edge Computing / IoT / Distributed Systems

Serverless computing (FaaS, Function as a Service) is emerging as a new paradigm and execution mode for the next generation of cloud-native computing due to many advantages, such as cost-effective pay-per-use benefits, high scalability, and ease of deployment. Serverless function is designed to be fine-grained and event-driven so that they can scale elastically to accommodate workload changes. However, current public serverless computing platforms are only supporting CPU instead of accelerators, such as GPU and TPU. With the increasing number of deep learning applications in the cloud, it is imperative to offer GPU support.

Moreover, the utilization of current GPUs within many Kubernetes-based serverless computing platforms is suboptimal. This is primarily due to the prevalence of functions centered around deep learning inferences, which often fail to effectively harness the full capacity of a GPU. As a result, there is a pressing need to establish more fine-grained GPU-sharing mechanisms for serverless functions. Presently, various GPU sharing approaches exist, such as rCUDA [1], cGPU [2], qGPU [3], vCUDA[4], MIG[5], and FaST-GShare[6]. Each of these mechanisms possesses distinct advantages in terms of GPU sharing. Without a proper understanding of these mechanisms, it becomes challenging to select an appropriate sharing solution for users.

We aim to delve into these advantages comprehensively and design a unified platform that integrates these mechanisms and automatically selects specific mechanisms for different functions according to the unique attributes for high GPU utilization and function SLO guarantee.

Similar platform: https://github.com/elastic-ai/elastic-gpu

[1] http://www.rcuda.net/

[2] https://github.com/lvmxh/cgpu

https://www.alibabacloud.com/zh/solutions/cgpu

[3] https://www.tencentcloud.com/document/product/457/42973?lang=en&pg=

https://github.com/elastic-ai/elastic-gpu

[4] https://github.com/tkestack/gpu-manager

Lin Shi, Hao Chen and Jianhua Sun, "vCUDA: GPU accelerated high performance computing in virtual machines," 2009 IEEE International Symposium on Parallel & Distributed Processing, Rome, 2009, pp. 1-11, doi: 10.1109/IPDPS.2009.5161020.

[5]  https://docs.nvidia.com/datacenter/cloud-native/kubernetes/latest/index.html#

https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html

[6]  https://www.ce.cit.tum.de/en/caps/news/news-single-view-en/article/paper-fast-gshare-enabling-efficient-spatio-temporal-gpu-sharing-in-serverless-computing-for-dl-inference-got-accepted-at-acm-icpp2023/

Goals:

1. Choose any two mechanisms and integrate them on the serverless computing platform based on the elastic GPU framework.

2. Prototype the solution on GCP.

Requirements

  • Basic knowledge of FaaS platforms. Knowledge of Knative/OpenFaaS is beneficial.
  • Knowledge of docker, K8s.
  • Tensorflow or PyTorch.
  • Basic Knowledge of CUDA and NVIDIA GPU.

We offer:

  • Thesis in the area that is highly demanded by the industry
  • Our expertise in data science and systems areas
  • Supervision and support during the thesis
  • Access to different systems required for the work
  • Opportunity to publish a research paper with your name on it

What we expect from you:

  • Devotion and persistence (= full-time thesis)
  • Critical thinking and initiativeness
  • Attendance of feedback discussions on the progress of your thesis

Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu@tum.de).

The topic will focus on vertical scaling and horizontal scaling for deep learning inference applications in serverless computing platforms. Currently, the scaling in K8s(Kubernetes) and serverless frameworks mainly utilizes horizontal scaling. However, deep learning (DL) applications usually have large parameter models that consume significant amounts of GPU memory. Horizontal scaling for DL apps entails that each replica needs to load a copy of model parameters, thereby exacerbating memory consumption issues. Meanwhile, most dedicated inference engines and systems in the cloud, like NVIDIA Triton, Kserve, KubeRay, and, etc. , are using vertical scaling, which refers to allocating more GPU resources to a replica to meet the increasing request load. Both types of scaling have their own advantages. The topic will revolve around designing an auto-scaling system for deep learning applications that supports hybrid auto-scaling in serverless computing platforms, enabling the system to achieve SLO-aware and seamless scaling. The technique will include the Batch system, Tensor migration and relative Algorithms in auto-scaling (vertical and horizontal) mechanism in K8s, and Tensor storage.

Goals:

1. Propose a vertical scaling/batch mechanism based on horizontal scaling in serverless computing/K8s for more efficient and SLO-aware deep learning inference as well as improving GPU utilization.

2. Experients and Analysis.

Requirements

  • Familiar with C++, Python, Linux Shell, K8s, Container.
  • Familiar with Tensorflow or PyTorch.
  • Basic Knowledge of CUDA and NVIDIA GPU.

We offer:

  • Thesis in the area that is highly demanded by the industry
  • Our expertise in data science and systems areas
  • Supervision and support during the thesis
  • Access to different systems required for the work
  • Opportunity to publish a research paper with your name on it

What we expect from you:

  • Devotion and persistence (= full-time thesis)
  • Critical thinking and initiativeness
  • Attendance of feedback discussions on the progress of your thesis

Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu@tum.de).
 

 

 

With the rapid progress of electric vehicles and associated technologies, intelligent driving has already become the primary direction for the future development of electric vehicles. Intelligent driving systems have intricate internal architectures, primarily consisting of modules such as Localization, Perception, Prediction, Planning, and Control. With the rapid advancement of deep learning technology, these modules predominantly employ deep learning algorithms, which entail high computational demands and rely on the computational power and resources provided by underlying GPU hardware. However, in autonomous driving computing platforms, GPU resources are often severely constrained, and they need to simultaneously support multiple deep learning tasks. Combined with the high safety requirements of autonomous driving systems, how to efficiently allocate and isolate GPU resources for deep learning tasks within these modules has become a critical challenge.

For example, when subtasks such as Object Detection and Road Segmentation within the Perception module compete for GPU compute units, inadequate GPU resource scheduling can result in task-related latency. This, in turn, leads to unpredictable delays in the overall execution chain of autonomous driving tasks based on Directed Acyclic Graphs (DAGs).

The project will leverage the Baidu Apollo Autonomous Driving System as its foundational platform, while employing NVIDIA GPUs and CUDA for the execution of deep learning tasks. Additionally, we will incorporate NVIDIA CUDA techniques to develop a GPU resource isolation and scheduling mechanism for the autonomous driving architecture.

Goals:

1. A mechanism enabling GPU Resource Isolation and Scheduling on Apollo Autonomous Driving System.

2. Experients and Analysis.

Requirements

  • Familiar with C++, Python, Linux Shell.
  • Tensorflow or PyTorch.
  • Basic Knowledge of CUDA and NVIDIA GPU.

We offer:

  • Thesis in the area that is highly demanded by the industry
  • Our expertise in data science and systems areas
  • Supervision and support during the thesis
  • Access to different systems required for the work
  • Opportunity to publish a research paper with your name on it

What we expect from you:

  • Devotion and persistence (= full-time thesis)
  • Critical thinking and initiativeness
  • Attendance of feedback discussions on the progress of your thesis

Apply now by submitting your CV and grade report to Jianfeng Gu (jianfeng.gu@tum.de).

 

Background

With the rapid development of Cloud in the recent years, attempts have been made to bridge the widening gap between the escalating demands for complex simulations to be performed against tight deadlines and the constraints of a static HPC infrastructure by working on a Hybrid Infrastructure which consists of both the current HPC Clusters and the seemingly infinite resources of Cloud. Cloud is flexible and elastic and can be scaled as per requirements.

The BMW Group, which runs thousands of compute-intensive CAEx (Computer Aided Engineering) simulations every day needs to leverage the various offerings of Cloud along with optimal utilization of its current HPC Clusters to meet the dynamic market demands. As such, research is being carried out to schedule moldable CAE workflows in a Hybrid setup to find optimal solutions for different objectives against various constraints.

Goals

  1. The aim of this work is to develop and implement scheduling algorithms for CAE workflows on a Hybrid Cloud on an existing simulator using meta-heuristic approaches such as Ant Colony or Particle Swarm Optimization. These algorithms need to be compared against other baseline algorithms, some of which have already been implemented in the non-meta-heuristic space.
  2. The scheduling algorithms should be a based on multi-objective optimization methods and be able to handle multiple objectives against strict constraints.
  3. The effects of moldability of workflows with regards to the type and number of resource requirements and the extent of moldability of a workflow is to be studied and analyzed to find optimal solutions in the solution space.
  4. Various Cloud offerings should be studied, and the scheduling algorithms should take into account these different billing and infrastructure models while make decisions regarding resource provisioning and scheduling.

Requirements

  • Experience or knowledge in Scheduling algorithms
  • Experience or knowledge in the principles of Cloud Computing
  • Knowledge or Interest in heuristic and meta-heuristic approaches
  • Knowledge on algorithmic analysis
  • Good knowledge of Python

We offer:

  •   Collaboration with BMW and its researchers
  •   Work with industry partners and giants of Cloud Computing such as AWS
  •   Solve tangible industry-specific problems
  •  Opportunity to publish a paper with your name on it

 

What we expect from you:

  • Devotion and persistence (= full-time thesis)
  • Critical thinking and initiativeness
  • Attendance of feedback discussions on the progress of your thesis

The work is a collaboration between TUM and BMW

Apply now by submitting your CV and grade report to Srishti Dasgupta (srishti.dasgupta@bmw.de)

Background: Social good applications such as monitoring environments require several technologies, including federated learning. Implementing federated learning expects a robust balance between communication and computation costs involved in the hidden layers. It is always a challenge to diligently identify the optimal values for such learning architectures. 

Keywords: Edge, Federated Learning, Optimization, Social Good,  

Research Questions: 

1.     How to design a decentralized federated learning framework that applies to social good applications?

2.     Which optimization parameters need to be considered for efficiently targeting the issue?

3.     Are there any optimization algorithms that could deliver a tradeoff between the communication and computation parameters?

Goals: The major goals of the proposed research are given below: 

1.     To develop a framework that delivers a decentralized federated learning platform for social good applications.   

2.     To develop at least one optimization strategy that addresses the existing tradeoffs in hidden neural network layers. 

3.     To compare the efficiency of the algorithms with respect to the identified optimization parameters.  

Expectations: The students are expected to have an interest to develop frameworks with more emphasis on federated learning; they have to committedly work and participate in the upcoming discussions/feedbacks (mostly online); they have to stick to the deadlines which will be specified in the meetings. 

For more information, contact: Prof. Michael Gerndt (gerndt@tum.de) and Shajulin Benedict (shajulin@iiitkottayam.ac.in)

Background: 

This thesis in in collaboration with IfTA GmbH. Details on the thesis can be found on the respective page of IfTA (in german): Masterarbeit Echtzeitfähige Nutzung von mehreren Rechenkernen auf Zynq Ultrascale+ Architektur

Contact: roman.karlstetter@tum.de

 

Modeling and Analysis of HPC Systems/Applications

Background:
The end of Dennard scaling, slowing down of Moore's law, as well as emerging applications such as LLMs caused considerable changes in HPC hardware architectures. One such example can be seen in FPUs, i.e., a variety of arithmetic operations with different precisions are now available on modern HPC processors. This is because (1) arithmetic precision and performance (or energy efficiency) are in a trade-off relationship; and (2) it is well-known that emerging ML applications generally do not need higher precisions such as the traditional FP32/64.

Goal and Approach:
You will explore the use of lower precision arithmetics in scientific computing, with a particular focus on FFTs (Fast Fourier Transforms). To this end, you will look into FFT kernels used in HPC applications developed in Plasma-PEBS project (https://cordis.europa.eu/project/id/101093261) or DarExa-F project (https://gauss-allianz.de/de/project/title/DaREXA-F), and will explore the use of lower precision arithmetics. More specifically, you will modify those kernels to test various precision combinations for the variables used in them (per loop or at their definition) and will observe performance, energy, as well as the error of the simulation output (compared with FP64) using various inputs. You will pick up several hardware available in CAPS Cloud (https://www.ce.cit.tum.de/caps/hw/caps-cloud/) or LRZ Beast machines (https://www.lrz.de/presse/ereignisse/2020-11-06_BEAST/) to test your codes.

Requirements:

  • C/Fortran programming experiences
  • In genreral, we would be very happy with guiding anyone self-motivated, capable of critical thinking, and curious in computer science.
  • We don't want you to be too passive – you are supposed to think/try yourself to some extend, instead of fully following our instructions step by step.
  • If your main goal is passing with any grade (e.g., 2.3), we'd suggest you look into a different topic.
  • If you are interested in getting a PhD degree in HPC, this will be a good topic.

Contact:
Dr. Eishi Arima, eishi.arima@tum.de, https://www.ce.cit.tum.de/caps/mitarbeiter/eishi-arima/
Prof. Dr. Martin Schulz

Background:

HPC systems are becoming increasingly heterogeneous as a consequence of the end of Dennard scaling, slowing down of Moore's law, and various emerging applications including LLMs, HPDAs, and others. At the same time, HPC systems consume a tremendous amount of power (can be over 20MW), which requires sophisticated power management schemes at different levels from a node component to the entire system. Driven by those trends, we are studing on sophisticated resource and power management techniques specifically tailored for modern HPC systems, as a part of Regale project (https://regale-project.eu/).

Research Summary:

In this work, we will focus on co-scheduling (co-locating multiple jobs on a node to minimize the resource wastes) and/or power management on HPC systems, with a particular focus on heterogeneous computing systems, consisting of multiple different processors (CPU, GPU, etc.) or memory technologies (DRAM, NVRAM, etc.). Recent hardware components generally support a variety of resource partitioning and power control features, such as cache/bandwidth partitioning, compute resource partitioning, clock scaling, power/temperature capping, and others, controllable via previlaged software. You will first pick up some of them and investigate their impact on HPC applications in performance, power, energy, etc. You will then build an analytical or emperical model to predict the impact and develop a control scheme to optimize the knob setups using your model. You will use hardware available in CAPS Cloud (https://www.ce.cit.tum.de/caps/hw/caps-cloud/) or LRZ Beast machines (https://www.lrz.de/presse/ereignisse/2020-11-06_BEAST/) to conduct your study.

Requirements:

  • Basic knowledges/skills on computer architecture, high performance computing, and statistics
  • Basic knowledges/skills on surrounding areas would also help (e.g., machine learning, control theory, etc.).
  • In genreral, we would be very happy with guiding anyone self-motivated, capable of critical thinking, and curious about computer science.
  • We don't want you to be too passive – you are supposed to think/try yourself to some extend, instead of fully following our instructions step by step.
  • If your main goal is passing with any grade (e.g., 2.3), we'd suggest you look into a different topic.

See also our former studies:

Contact:

Dr. Eishi Arima, eishi.arima@tum.de, https://www.ce.cit.tum.de/caps/mitarbeiter/eishi-arima/

Prof. Dr. Martin Schulz

Description:
Benchmarks are an essential tool for performance assessment of HPC systems. During the pro-
curement process of HPC systems both benchmarks and proxy applications are used to assess
the system which is to be procured. New generations of HPC systems often serve the current
and evolving needs of the applications for which the system is procured. Therefore, with new
generations of HPC systems, the selected proxy application and benchmarks to assess the sys-
tems’ performance are also selected for the specific needs of the system. Only a few of these
have stayed persistent over longer time periods. At the same time the quality of benchmarks
is typically not questioned as they are seen to only be representatives of specific performance
indicators.

This work aims to provide a more systematic approach with the goal of evaluating benchmarks
targeting the memory subsystem, looking at capacity latency and bandwidth.
 

Problem statement:
How can benchmarks used to assess memory performance, including cache usage, be system-
atically compared amongst each others?

Project Description

 

Description:
Benchmarks are an essential tool for performance assessment of HPC systems. During the
procurement process of HPC systems both benchmarks and proxy applications are used to as-
sess the system which is to be procured. With new generations of HPC systems, the selected
proxy application and benchmarks are often exchanged and benchmarks for specific needs of
the system are selected. Only a few of these have stayed persistent over longer time periods. At
the same time the quality of benchmarks is typically not questioned as they are seen to only be
representatives of specific performance indicators.


This work targets to provide a more systematic approach with the goal of evaluating bench-
marks targeting Network performance, namely regarding MPI (Message Passing Interface) in
both functional test as well as for benchmark applications.


Problem statement:
How can benchmarks used to assess Network performance, using MPI routines, be systemati-
cally compared amongst each others?

Project Description

Description:
Benchmarks are an essential tool for performance assessment of HPC systems. During the pro-
curement process of HPC systems both benchmarks and proxy applications are used to assess
the system which is to be procured. New generations of HPC systems often serve the current
and evolving needs of the applications for which the system is procured. Therefore, with new
generations of HPC systems, the selected proxy application and benchmarks to assess the sys-
tems’ performance are also selected for the specific needs of the system. Only a few of these
have stayed persistent over longer time periods. At the same time the quality of benchmarks
is typically not questioned as they are seen to only be representatives of specific performance
indicators.


This work aims to evaluate benchmarks for input and output (I/O) performance to provide a
systematic approach to evaluate benchmarks targeting read and write performance of different
characteristics as seen in application behavior, mimiced by benchmarks.
 

Problem statement:
How can benchmarks used to assess I/O performance be systematically compared amongst
each others?

Thesis Description

Resource Management for Supercomputing

Background:

Dynamic resources can potentially improve several system metrics.
However, to unlock this potential, HPC applications need to be implemented with support for dynamic resource reconfigurations.
An important step into this direction is the inclusions of dynamic resource interfaces into widely used HPC libraries to support application developers in the implementation of adaptives applications.

Thesis Goal:   

The goal of this thesis is to design, implement and evaluate such dynamic resource extensions for a widely used HPC library such as PetSc, P4est, Hypre etc. The performance of the extensions should be evaluated on a real HPC system.

Contact:

Dominik Huber (domi.huber@tum.de)

Background: Dynamic Resource Management promises better utilization of resources in HPC systems. However, this requires significant changes to current scheduling strategies as the dynamically varying resource requirements and utilizations of running applications needs to be taken into account.  So far, there exists only limited knowledge about efficient scheduling strategies in such scenerios. Therefore, new strategies need to be explored based on scheduling simulations.

Thesis Goal: The goal of this thesis is to explore new scheduling strategies in a scenerio of dynamic resource management. To this end, the ElastiSim scheduling simulator should be used to evaluate different scheduling policies and application models.

Contact: Dominik Huber (domi.huber@tum.de), Prof. Martin Schulz, Prof. Martin Schreiber

Background:

One important optimization objective in HPC systems is to improve the system throughput per energy.
Dynamic resource management could potentially improve this metric. However, to apply benefitial reconfiguration actions, precise performance and energy profiles for different resource configurations need to be available to the resource manager.

Thesis Goal:   

The goal of this thesis is to write and evaluate benchmarks for resource dynamic applictations and to create application models based on these results.

Contact:

Dominik Huber (domi.huber@tum.de) 

Background: Scheduling resources on systems with dynamic resource management is a challenging task. One important aspect in this context is the description of the dynamic resource requirements and performance behavior of applications as input to dynamic scheduling strategies.

Thesis Goal: The goal of this thesis is to develop a Domain Specific Language to express dynamic resource requirements of HPC applications. A next step would then be the development of scheduling strategies based on the provided data.

Contact: Dominik Huber (domi.huber@tum.de), Prof. Martin Schulz, Prof. Martin Schreiber

Background:

Resource management software needs to keep track of resources and resource assignments in the system.
The currently used data structures such as static bit arrays or explicit representations often do not consider the challenges of dynamically changing resources or memory overhead in large-scale systems. New data structures need to be explored to address these challenges in future resource management software.

Thesis Goal:

The goal of this thesis is to come explore new data structures and implementations for efficent resource tracking in dynamic environments.
The performance of the new approaches should be evaluated and compared to currently used implementations such as "Process Sets" in the PMIx Reference RunTime Environment (PRRTE).

Contact:

Dominik Huber (domi.huber@tum.de), Prof. Martin Schulz, Prof. Martin Schreiber

Memory Management and Optimizations on Heterogeneous HPC Architectures

Background
sys-sage(https://github.com/caps-tum/sys-sage) is a library for capturing and manipulating hadrware topology of compute systems, and their attributes. So far, we have mainly been focussed on the classical computing elements, such as HPC nodes, CPUs or GPUs. However, we would like to extend sys-sage functionality to Quantum Computing systems.

Quantum Computers are very complex units, and in order to use them and schedule quantum algorithms on them, we need to solve a lot of problems, especially when we consider more complex algorithms consisting of different quantum gates and many qubits. We would like to use sys-sage to represent all the hardware and quantum physics-based properties of a Quantum Computing system. This may include static properties, such as mapping of qubits and available gates, or qubits that can interact with each other. It will also include dynamic properties, such as stability of particular qubits, or other current physical properties of the system or its parts.

As this task is more complex, we can split this general focus into multiple BA/MA/GR sub-tasks. They can be adjusted to the student's education status (BSc/MSc) and level of expertise in the area.

Topics in this area will be co-advised with LRZ.

Contact:
In case of interest or any questions, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.

Published on 1.9.2023 (27)

GPUScout is a performance analysis tool developed at TUM that performs analyses of NVidia CUDA kernels with the aim to identify common pitfalls related to data movements on a GPU. It combines static SASS code analysis, PC stall sampling, and NCU metrics collection to identify the bottlenecks, assess its severity, and provide additional information about the identified code section. As of now, it presents its findings in a textual form, printed in the terminal. 

The goal of this thesis is to explore the options of extending the support of GPUscout to AMD GPUs. This includes evaluating a strategy how to interact with the APIs and debug interfaces available on AMD GPUs, and how to use these to provide insight about the source code, which includes the static code analysis, combined with some kind of metrics or sampling mechanism, with the goal to provide information about the bottlenecks in the kernel, about its location in the source code, and its severity, as indicated by the metrics. Depending on the available profiling and debugging tools and interfaces support from the AMD side, the solution may be more or less similar to the current GPUscout logic. As a GR, this work can be focussed on the research of the functionality and stay more in the design/PoC phase. 

Contact:
In case of interest, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.

Published on 25.1.2024 (30)

GPUScout is a performance analysis tool developed at TUM that performs analyses of NVidia CUDA kernels with the aim to identify common pitfalls related to data movements on a GPU. It combines static SASS code analysis, PC stall sampling, and NCU metrics collection to identify the bottlenecks, assess its severity, and provide additional information about the identified code section. As of now, it presents its findings in a textual form, printed in the terminal. The output provides all necessary information, however the way of providing the information should be more user-friendly.

The goal of this work is to design a user interface, which offers users greater support in identifying GPU-based memory-related bottlenecks, and also supports the users with the process of mitigating these bottlenecks. We will develop a concept of what information and in which form should be presented, and will implement a prototype to verify the concept. An example optimization procedure will be conducted to showcase the effectiveness of the implemented frontend.

Contact:
In case of interest, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.

Updated on 25.1.2024 (24)

Background:

The DEEP-SEA(https://www.deep-projects.eu) project is a joint European effort of ca. a dozen leading universities and research institutions on developing software for coming Exascale supercomputing architectures. CAPS TUM, as a member of the project, is responsible for development of an environment for analyzing application and system performance in terms of data movements. Data movements are very costly compared to computation capabilities. Therefore, suboptimal memory access patterns in an application can have a huge negative impact on the overall performance. Contrarily, analyzing and optimizing the data movements can increase the overall performance of parallel applications massively.

We develop a toolchain with the goal to create a full in-depth analysis of a memory-related application behaviour. It consists of tools Mitos(https://github.com/caps-tum/mitos), sys-sage(https://github.com/caps-tum/sys-sage), and MemAxes(https://github.com/caps-tum/MemAxes). Mitos collects the information about memory accesses, sys-sage captures the memory and compute topology and capabilities of a system, and provides a link between the hardware and the performance data, and finally, MemAxes analyzes and visualizes outputs of the aforementioned projects.

There is an existing PoC of these tools, and we plan on extending and improving the projects massively to fit the needs of state-of-the-art and future HPC systems, which are expected to be the core of upcoming Exascale supercomputers. Our work and research touches modern heterogeneous architectures, patterns, and designs, and aims at enabling the users to run extreme-scale applications with utilizing as much of the underlying hardware as possible.

Context:

  • The current implementation of Mitos/MemAxes collects PEBS samples of memory accesses (via perf), i.e. every n-th memory operation is measured and stored.
  • Collecting aggregate data alongside with PEBS samples could help increase the overall understanding of the system and application behaviour.

Tasks/Goals: 

  • Analyse what aggregate data are meaningful and possible to collect (total traffic, BW utilization, num LD/ST, ...?) and how to collect them (papi? likwid? perf?)
  • Ensure that these measurements don't interfere with the existing collection of PEBS samples.
  • Design and implement a low-overehad solution.
  • Find a way to visualise/present the data in MemAxes tool (or different visualisation tool if MemAxes is not suitable.
  • Finally, present how the newly collected data help the users to understand the system or hint the user if/how to do optimizations.

Contact:

In case of interest, please contact Stepan Vanecek (stepan.vanecek@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz).
 

Updated on 12.09.2022

Various MPI-Related Topics

Please Note: MPI is a high performance programming model and communication library designed for HPC applications. It is designed and standardised by the members of the MPI-Forum, which includes various research, academic and industrial institutions. The current chair of the MPI-Forum is Prof. Dr. Martin Schulz.  The following topics are all available as Master's Thesis and Guided Research. They will be advised and supervised by Prof. Dr. Martin Schulz himself, with help of researches from the chair. If you are very familiar with MPI and parallel programming, please don't hesitate to drop a mail to Prof. Dr. Martin Schulz.  These topics are mostly related to current research and active discussions in the MPI-Forum, which are subject of standardisation in the next years. Your contribution achieved in these topics may make you become contributor to the MPI-Standard, and your implementation may become a part of the code base of OpenMPI. Many of these topics require a collaboration with other MPI-Research bodies, such as the Lawrence Livermore National Laboratories and Innovative Computing Laboratory. Some of these topics may require you to attend MPI-Forum Meetings which is at late afternoon (due to time synchronisation worldwide). Generally, these advanced topics may require more effort to understand and may be more time consuming - but they are more prestigious, too. 

LAIK is a new programming abstraction developed at LRR-TUM

  • Decouple data decompositionand computation, while hiding communication
  • Applications work on index spaces
  • Mapping of index spaces to nodes can be adaptive at runtime
  • Goal: dynamic process management and fault tolerance
  • Current status: works on standard MPI, but no dynamic support

Task 1: Port LAIK to Elastic MPI

  • New model developed locally that allows process additions and removal
  • Should be very straightforward

Task 2: Port LAIK to ULFM

  • Proposed MPI FT Standard for “shrinking” recovery, prototype available
  • Requires refactoring of code and evaluation of ULFM

Task 3: Compare performance with direct implementations of same models on MLEM

  • Medical image reconstruction code
  • Requires porting MLEM to both Elastic MPI and ULFM

Task 4: Comprehensive Evaluation

ULFM (User-Level Fault Mitigation) is the current proposal for MPI Fault Tolerance

  • Failures make communicators unusable
  • Once detected, communicators an be “shrunk”
  • Detection is active and synchronous by capturing error codes
  • Shrinking is collective, typically after a global agreement
  • Problem: can lead to deadlocks

Alternative idea

  • Make shrinking lazy and with that non-collective
  • New, smaller communicators are created on the fly

Tasks:

  • Formalize non-collective shrinking idea
  • Propose API modifications to ULFM
  • Implement prototype in Open MPI
  • Evaluate performance
  • Create proposal that can be discussed in the MPI forum

ULFM works on the classic MPI assumptions

  • Complete communicator must be working
  • No holes in the rank space are allowed
  • Collectives always work on all processes

Alternative: break these assumptions

  • A failure creates communicator with a hole
  • Point to point operations work as usual
  • Collectives work (after acknowledgement) on reduced process set

Tasks:

  • Formalize“hole-y” shrinking
  • Proposenew API
  • Implement prototype in Open MPI
  • Evaluate performance
  • Create proposal that can be discussed in the MPI Forum

With MPI 3.1, MPI added a second tools interface: MPI_T

  • Access to internal variables 
  • Query, read, write
  • Performance and configuration information
  • Missing: event information using callbacks
  • New proposal in the MPI Forum (driven by RWTH Aachen)
  • Add event support to MPI_T
  • Proposal is rather complete

Tasks:

  • Implement prototype in either Open MPI or MVAPICH
  • Identify a series of events that are of interest
  • Message queuing, memory allocation, transient faults, …
  • Implement events for these through MPI_T
  • Develop tool using MPI_T to write events into a common trace format
  • Performance evaluation

Possible collaboration with RWTH Aachen

PMIxis a proposed resource management layer for runtimes (for Exascale)

  • Enables MPI runtime to communicate with resource managers
  • Come out of previous PMI efforts as well as the Open MPI community
  • Under active development / prototype available on Open MPI

Tasks: 

  • Implement PMIx on top of MPICH or MVAPICH
  • Integrate PMIx into SLURM
  • Evaluate implementation and compare to Open MPI implementation
  • Assess and possible extend interfaces for tools 
  • Query process sets

MPI was originally intended as runtime support not as end user API

  • Several other programming models use it that way
  • However, often not first choice due to performance reasons
  • Especially task/actor based models require more asynchrony

Question: can more asynchronmodels be added to MPI

  • Example: active messages

Tasks:

  • Understand communication modes in an asynchronmodel
  • Charm++: actor based (UIUC)•Legion: task based (Stanford, LANL)
  • Propose extensions to MPI that capture this model better
  • Implement prototype in Open MPI or MVAPICH
  • Evaluation and Documentation

Possible collaboration with LLNL and/or BSC

MPI can and should be used for more than Compute

  • Could be runtime system for any communication
  • Example: traffic to visualization / desktops

Problem:

  • Different network requirements and layers
  • May require different MPI implementations
  • Common protocol is unlikely to be accepted

Idea: can we use a bridge node with two MPIs linked to it

  • User should see only two communicators, but same API

Tasks:

  • Implement this concept coupling two MPIs
  • Open MPI on compute cluster and TCP MPICH to desktop
  • Demonstrate using on-line visualization streaming to front-end
  • Document and provide evaluation
  • Warning: likely requires good understanding of linkers and loaders

Field-Programmable Gate Arrays

Field Programmable Gate Arrays (FPGAs) are considered to be the next generation of accelerators. Their advantages reach from improved energy efficiency for machine learning to faster routing decisions in network controllers. If you are interested in one of it, please send your CV and transcript record to the specified Email address.

Our chair offers various topics available in this area:

  • Machine Learning: Your tasks will be to focus on implementing different ML algorithms on FPGAs, our main focus is data distillation. (dirk.stober@tum.de)
  • Open-Source EDA tools: If you are interested in exploring open-source EDA tools, especially High Level Synthesis, you can do an exploration of available tools. (dirk.stober@tum.de)
  • Memory on FPGA: Exploration of memory on FPGA and devolping profiling tools for AXI-Interconnects  (dirk.stober@tum.de).
  • Quantum Computing: Your tasks will be to explore architectures that harness the power of traditional computer architecture to control quantum operations and flows. Now we focus on superconducting qubits & neutral atoms control. (xiaorang.guo@tum.de)
  • Direct network operations: Here, FPGAs are wired closer to the networking hardware itself, hence allows to overcome the network stack which a regular CPU-style communication would be exposed to. Your task would be to investigate FPGAs which can interact with the network closer than CPU-based approaches. ( martin.schreiber@tum.de )
  • Linear algebra: Your task would be to explore strategies to accelerate existing linear algebra routines on FPGA systems by taking into account applications requirements. ( martin.schreiber@tum.de )
  • Varying accuracy of computations: The granularity of current floating-point computations is 16, 32, or 64 bit. Your work would be on tailoring the accuracy of computations towards what's really required. ( martin.schreiber@tum.de )
  • ODE solver: You would work on an automatic toolchain for solving ODEs originating from computational biology. ( martin.schreiber@tum.de )

 

Background

The Qubit control system bridges the quantum software stack and the physical backend. Typically, it includes a quantum control processor, required memory, and a signal generator (with QICK, for example). So far, qubits are mainly controlled by radio frequency waveforms, but current control technologies based on commercial AWG (arbitrary waveform generator) or FPGA-based DAC/ADC lack scalability and efficiency. With newly published RFSoC FPGAs, we can develop control logic and waveform generators on a single board. However, new systems integrating all things demands additional effort.

Task

1. Study the current system architecture of qubit control and identify which part/interface we need and can improve.                 

2. Optimize the current control processor in terms of frequency and ISA development.

3. Integration of the control processor with signal generators, and additional memory design may be necessary.

As this task is complex, we will split this general focus into multiple sub-tasks. They can be adjusted to the student's education status (BSc/MSc) and level of expertise in the area.

Topics in this area may be collaborated with Fraunhofer or MPQ.

Requirement

Experience in programming with VHDL/Verilog.

Contact:
In case of interest or any questions, please contact Xiaorang Guo (xiaorang.guo@tum.de) at the Chair for Computer Architecture and Parallel Systems (Prof. Schulz) and attach your CV & transcript of records.

Background: Neutral atoms show great promise as qubit candidates for fully controllable and scalable quantum computers. Yet, the creation of a two-dimensional, defect-free atomic array (sorting) within a short time frame remains a significant challenge. This thesis explores the use of FPGA-based acceleration designs to reduce the time overhead of the sorting process, release the full potential of quantum advantages."

Thesis Goal: The objective of this thesis is to design a sorting unit on an FPGA board using Verilog or High-Level Synthesis (HLS) to achieve significant acceleration when compared to CPU/GPU implementations. The design should prioritize low latency and high parallelization to enhance the overall efficiency of the sorting process.

Contact: Xiaorang Guo(xiaorang.guo@tum.de), Jonas Winklmann (jonas.winklmann@tum.de), Prof. Martin Schulz

Various Thesis Topics in Collaboration with Leibniz Supercomputing Centre

We have a variety of open topics. Get in contact with Josef Weidendorfer or Amir Raoofy

Description:
The versatility and popularity of the Python programming language make it among the most used languages for scientific computing,
on all scales from laptop to HPC cluster, and for a broad range of tasks: from quick code prototyping to extensive linear algebra computations (python blas), to the processing of large real-world or simulated datasets (yt, astropy, ...), or again the modeling of varied environments involving complex equation systems, and much more.  
Yet the high efficiency demands of HPC hardware require the adoption of specialized libraries and compilers in order to achieve parallel and scalable workloads and leverage the advantages offered by heterogeneous computing (GPUs and accelerators). 

In previous work [1,2], we evaluated the performance of the Intel Distribution for Python, for post-processing of astrophysical simulations with the yt package [3], against the baseline Python provided by Anaconda. Our study encompassed both interpreted Python scripts, or binary code compiled with Cython, on Intel Xeon CPUs. 

We aim now to accelerate Python workflows over Intel GPUs, targeting both individual workstations with integrated or discrete graphic cards, and the HPC-grade GPUs of 
SuperMUC-NG phase2 (codename Ponte Vecchio, PVC). In particular, we plan to use the Intel Data Parallel Extensions (DPEP) for Python, which include own implementations of the numpy computing library (dpnp) and the numba JIT compiler (numba-dpex) in order to evaluate the speedup Intel GPUs provide to compute-intensive scientific workloads, with minimal to no interface change. The specific use cases will be defined in concert with the candidate. We will begin with official numpy or dpnp tutorials.

As a bonus, the generated code can also be profiled with Intel tools such as APS and VTune for bottlenecks, inefficiencies and hardware utilization. 
If further optimizations are necessary, they may be achieved through the dpctl library (short for data parallel control, also part of DPEP) for manual allocations and data management. Among the other things, dpctl enable explicit use of the Unified Shared Memory (USM) capabilities of Intel GPUs; the underlying programming model and 
back-end layers are based on SYCL, thus aligned with research previously conducted by Dr. Cielo [4]. 
As for the previous projects, an open channel with Intel staff will be available for support and, if applicable, close collaborations and promotion of the result.

Given the speedup of preliminary tests (> 10x for SP float matrix multiplication with numpy) we are confident that our result will unlock qualitatively new possibilities for Python users at LRZ, and thus new HPC use cases for SuperMUC-NG. The simplest among our findings will also be included in the best practice Python courses offered at LRZ and/or otherwise published.

Bibliography

[1] Cielo et al., Speeding simulation analysis up with yt and Intel Distribution for Python - Intel Parallel Universe Magazine 38, 2019
[2] Cielo et al., Honing and proofing Astrophysical codes on the road to Exascale. - Future Generation Computer Systems 112, 2020
[3] Turk et al., yt: A MULTI-CODE ANALYSIS TOOLKIT FOR ASTROPHYSICAL SIMULATION DATA - ApJS 192, 2011
[4] Cielo et al, DPEcho: General Relativity with SYCL for the 2020s and beyond - Intel Parallel Universe Magazine 51, 2023

Contact

Dr. Salvatore Cielo, Leibniz Rechenzentrum - cielo@lrz.de
 

Description:

Synthetic benchmarks such as STREAM, HPL and HPCG play a crucial role in assessing the performance of HPC systems. However, the accuracy and reliability of benchmark results heavily depend on the proper configuration and execution of the benchmarking process. This thesis proposes the design and development of a comprehensive Python package to detect and verify the different stages of benchmarking process including the verification of setup and execution. Following are the key components and functionalities to be designed and developed as part of this work:

Tasks:

  • System Verification: 
    • Assess and validate hardware components to ensure that the system meets the specific requirements specified by a concrete benchmark.
    • Check for CPU architecture, memory availability, and other relevant system parameters.
  • Operating System Level Knobs:
    • Inspect the presence of knobs, such as frequency and operating system version
    • Verify the applicability of these knobs for benchmark execution
  • Software Stack Validation: 
    • Confirm the presence and compatibility of necessary software modules required by the benchmark
    • Verify the loading of modules to ensure a consistent and reliable software environment
  • Benchmarking Execution: Validate the execution of benchmarks using test cases

Requirements:

  • Python are required, knowledge of packages like pyyaml and pytest are beneficial.
  • Prior knowledge in HPC benchmarks and HPC system architecture is beneficial.
     

Contact:
In case of interest or any questions, please contact Arjun Parab (parab@lrz.de) and Amir Raoofy (amir.raoofy@lrz.de).

Description:
Addressing the challenge of reproducibility in high-performance computing (HPC) environments, this research explores methods to enhance the reproducibility of job runs in HPC environments by capturing essential contextual elements such as job scripts, module configurations using tools like Spack, and system configurations.
The aim is to develop a systematic approach that enables researchers to reproduce their computational experiments reliably even after a significant period of time, addressing issues related to software dependencies, environment changes, and hardware variations.

Tasks:

  • Literature Review:
    • Review existing literature on reproducibility in HPC environments.
    • Identify challenges and gaps in current approaches.
    • Analyze existing methods and tools used for capturing job scripts, module configurations, and system details.
  • Reproducibility Approaches:
    • Define a set of guidelines for capturing and documenting job scripts, module configurations, and system configurations.
    • Establish best practices for ensuring reproducibility in HPC workflows.
  • Prototype Implementation:
    • Design and implement a prototype tool or framework for capturing job scripts, module configurations, and system details automatically.
    • Ensure compatibility with common HPC environments and job schedulers
    • Implement functionality for reproducibility by using captured information.
  • Evaluation and Validation:
    • Evaluate the effectiveness and usability of the developed framework through experiments and case studies
    • Validate the ability of the framework to accurately reproduce job runs over time.

Requirements:

  • Prior understanding of HPC environments
  • Understanding of software packaging and dependency management tools like Spack is beneficial
  • Proficiency in Python, and bash is beneficial

Contact:
In case of interest or any questions, please contact Arjun Parab (parab@lrz.de) and Amir Raoofy (amir.raoofy@lrz.de).

Description:
The communication framework employed within MPI runtime environments considers various communication modes for message transmission such as fully asynchronous, eager, or synchronous. Switching between these modes can be represented by a piece-wise linear model with flexibility in the number of pieces. However, while this modeling approach is suitable for remote communication across distinct nodes, it does not cover local communications where processes utilize shared memory rather than network cards. In such cases, significant different performances and slight different behaviours are common due to using different communication protocols.
This open thesis gives the students the opportunity to employ modeling and simulation methodologies to analyze, anticipate, and enhance the efficiency of MPI communication within shared memory.

Tasks:

  • Literature review: Conducting a literature review on high-performance computing (HPC) simulators, particularly those focused on studying the performance of MPI communication on complex platforms. Based on the literature review, identifying the optimal simulator is expected.
  • Communication Analysis: Analyze MPI communication in shared memory architectures using benchmarking tools and performance profiling techniques.
  • Model Development: Develop mathematical models to represent the performance characteristics of MPI shared memory communication.
  • Validation and Evaluation: Validate the proposed models through simulation experiments and empirical studies on HPC clusters with shared memory configurations.

Requirements:

  • Good background in parallel computing and  HPC systems.
  • Proficiency in C/C++ programming languages.
  • Experience with parallel computing frameworks (MPI).

 

Contact: 

ehab.saleh@lrz.de

Description:
It's essential to observe and evaluate the power efficiency of HPC systems to control their energy usage. Programmess used in High Performance Computing (HPC) systems, like MPI applications, contribute substantially to the total energy consumption of HPC systems and simulations are widely used for examining how these applications perform under different conditions and this can also give us a prior assumption of power consumption tendency at a large scale.  The primary goal of this thesis is to precisely predict  the energy usage of MPI applications through simulation. To achieve this, it's essential to calibrate models for both computation and energy.


Tasks:

  • Literature review: Conducting a literature review on high-performance computing (HPC) simulators, particularly those focused on studying the prediction of execution time of MPI on multi-core architectures. Based on the literature review, identifying the optimal simulator is expected.
  • Develop models for power consumption and CPU computation: 
  • Develop a mathematical model to simulate CPU power consumption during execution of MPI applications. Some factors need to be taken into account when calibrating, such as CPU usage, memory access time, and communications overhead.
  • Validation and Verification:   Validate the proposed models  comparing simulated power consumption outcomes with empirical measurements obtained from real-world MPI applications running on HPC systems.


Requirements:

  • Good background in parallel computing and  HPC systems.
  • Proficiency in  C/C++  programming languages .
  • Experience with parallel computing frameworks (MPI).

 

Contact:

ehab.saleh@lrz.de

Benchmarking of (Industrial-) IoT & Message-Oriented Middleware

DDS (Data Distribution Service) is a message-oriented middleware standard that is being evaluated at the chair. We develop and maintain DDS-Perf, a cross-vendor benchmarking tool. As part of this work, several open theses regarding DDS and/or benchmarking in general are currently available. This work is part of an industry cooperation with Siemens.

Please see the following page for currently open positions here.

Note: If you are conducting industry or academic research on DDS and are interested in collaborations, please see check the open positions above or contact Vincent Bode directly.

Applied mathematics & high-performance computing

There are various topics available in the area bridging applied mathematics and high-performance computing. Please note that this will be supervised externally by Prof. Dr. Martin Schreiber (a former member of this chair, now at Université Grenoble Alpes).

This is just a selection of some topics to give some inspiration:

(MA=Master in Math/CS, CSE=Comput. Sc. and Engin.)

  • HPC tools:
    • Automated Application Performance Characteristics Extraction
    • Portable performance assessment for programs with flat performance profile, BA, MA, CSE
  • Projects targeting Weather (and climate) forecasting
    • Implementation and performance assessment of ML-SDC/PFASST in OpenIFS (collaboration with the European Center for Medium-Range Weather Forecast), CSE, MA
    • Efficient realization of fast Associated Legendre transformations on GPUs (collaboration with the European Center for Medium-Range Weather Forecast), CSE, MA
    • Fast exponential and implicit time integration, BA, MA, CSE
    • MPI parallelization for the SWEET research software, MA, CSE
    • Semi-Lagrangian methods with Parareal, CSE, MA
    • Non-interpolating Semi-Lagrangian Schemes, CSE, MA
    • Time-splitting methods for exponential integrators, CSE, MA
    • Machine learning for non-linear time integration, CSE, MA
    • Exponential integrators and higher-order Semi-Lagrangian methods

  • Ocean simulations:
    • Porting the NEMO ocean simulation framework to GPUs with a source-to-source compiler
    • Porting the Croco ocean simulation framework to GPUs with a source-to-source compiler
       
  • Health science project: Biological parameter optimization
    • Extending a domain-specific language with time integration methods
    • Performance assessment and improvements for different hardware backends (GPUs / FPGAs / CPUs)

If you're interested in any of these projects or if you search for projects in this area, please drop me an Email for further information

BA/MA/IDP: Designing, developing and evaluating tests for automatic assessment of low-level programming projects

We would like to improve and partially automate the evaluation of project submissions for system-oriented programming in C and performance optimization. For this we want to use an existing system that is already used for a similar task and extend it to our projects.
The focus of this can be adjusted to be more on didactics or low-level programming according to personal interests and expertise.

Tasks:

  • Analysis of existing project assignments, correction schemes and learning objectives and development of requirements for automated tests
  • Designing tests for a group of project tasks
  • Implementation of the tests
  • Evaluation of the implemented system

Contact:
Anna Mittermair (anna.mittermair@tum.de)