Multimodal Robot Learning from Demonstration for Laboratory Automation
Robotics, Robot Learning, Multimodal Sensing, Manipulation, Lab Automation
Description
Motivation
Chemical laboratories rely on a range of automated machines for tasks like liquid handling or centrifugation. Yet many repetitive procedures are still performed by human chemists, particularly if dexterous manipulation or adaptation to changing conditions are needed. Fully automating these tasks through traditional programming is impractical, since lab setups or execution protocols can vary significantly over time or between sites.
Learning from demonstration offers a compelling alternative. Instead of explicitly programming every motion and decision, an expert demonstrates a task and the robot learns a generalizable policy from a set of demonstrations. Combined with multimodal sensing, where data from cameras, tactile sensors and the robot's own proprioception are fused to understand the scene and guide execution, this enables a system that can adapt to variations and recover from disturbances.
The approach is not limited to laboratory tasks and can generalize to applications in manufacturing, assembly, etc.
System
The setup consists of a 4-DOF SCARA robot arm equipped with a parallel gripper, camera(s), tactile sensors and laboratory equipment. A demonstration setup already exists and is capable of showcasing the robot's manipulation capabilities in a laboratory environment.
Research Project
In this project, you will build on the existing demonstration setup and advance it towards robust learning from demonstration. This includes improving the physical setup, defining meaningful laboratory tasks to be learned and developing a multimodal sensing and learning pipeline. You will work with state-of-the-art models and algorithms in robot learning, adapt/tune them to the SCARA platform and evaluate how reliably new tasks can be acquired from a limited number of expert demonstrations. The work combines elements of robotics, machine learning, sensor fusion, and mechatronics.
Goals
Improve the existing demonstration setup and define a set of representative laboratory tasks for learning.
Improve and further develop a multimodal sensing pipeline that fuses data from cameras, robot proprioception and (possibly) tactile sensors to inform task execution.
Implement and tune learning-from-demonstration algorithms that allow the robot to acquire new tasks from expert demonstrations.
Evaluate robustness: the learned policies should handle (small) variations in conditions and recover from disturbances during execution.
Prerequisites
Interest in robotics, robot learning, sensors and a mechatronics-oriented approach to problem solving.
Good programming skills.
Excited to work on both hardware and software.
Prior experience with any of the following is a plus: ROS, robot learning, manipulation, 3D printing and CAD, deep learning frameworks.
If you are excited about the topic but don't check every box, feel free to reach out anyway!
Contact
valdrin.aslani@tum.de
Supervisor:
Immersive Remote Inspection with a Legged Robot Using Learning-Based Stereo Vision
Robotics, Computer Vision, Machine Learning, Human-Robot-Interaction, Communication Networks
Description
Motivation
Legged robotic platforms (robot dogs) enable remote inspections in environments that may be unstructured or hazardous to humans. Their ability to navigate rough terrain makes them more versatile than wheeled alternatives. Equipping a highly mobile quadruped with a stereoscopic vision system and linking it to an operator's head-mounted display (HMD) creates a telepresence system that offers a natural first-person view of the remote site, while keeping the human operator safe and preserving their situational awareness.
One central challenge of this approach is latency. Any delay between the operator's head movement and the corresponding visual update on the HMD breaks the sense of presence and can lead to motion sickness. In practice, this can heavily limit uninterrupted teleoperation time. Addressing this requires intelligent compensation strategies that anticipate and mask transmission delays.
System
The platform consists of a quadruped robot carrying a mechanically actuated stereoscopic camera system, which is wirelessly linked to the operator's HMD. The camera system mirrors the operator's head orientation in real time, providing a natural first-person perspective of the remote environment.
Research Project
In this project, you will develop and evaluate latency compensation techniques for the stereoscopic vision system. By exploiting optical and geometric properties of the camera system, combined with learning-based networks that predict the operator's gaze and steering intent from sensor and control data, the system is supposed to proactively prepare visual outputs before they are needed. You will integrate your approach into the existing teleoperation platform and evaluate it through user studies. The work combines elements of computer vision, machine learning, robotics, communication engineering and human-robot interaction.
Goals
Design and implement a latency compensation approach that exploits the vision system's design and develop learning-based intention prediction algorithms to maintain immersion under realistic network conditions. Ideally, you will be able to demonstrate a measurable reduction of motion sickness and visual discomfort compared to an uncompensated baseline. A central goal is to validate that operators can more comfortably perform continuous remote inspections for extended periods of time.
Prerequisites
Interest in robotics, computer vision, and machine learning.
Good programming skills.
Excited to work with real hardware (robot platform, cameras, HMD).
Prior experience with the following is a plus: VR systems, ROS, stereo vision, deep learning frameworks.
If you are excited about the topic but don't check every box, feel free to reach out anyway!
Contact
valdrin.aslani@tum.de
Supervisor:
Tutor for Software Engineering Lab Course
Description
We're looking for tutors for the Software Engineering lab this summer semester. The position is for 6 hours per week. Responsibilities include:
-
Assisting students with their homework during lab and tutorial sessions
-
Answering students' questions on Moodle
Prerequisites
Ideally, we're looking for a student who passed the course with a good grade. However any application is welcome if there is experience with sofware engineering. If you're interested, please send me an email. We expect student to attend these lab sessions on person
Date | Time
-----------|-------------------
22.04.2026 | 14:00–16:15
29.04.2026 | 11:30–13:00, 14:00–16:15
06.05.2026 | 14:00–16:15
13.05.2026 | 11:30–13:00, 14:00–16:15
20.05.2026 | 14:00–16:15
27.05.2026 | 11:30–13:00, 14:00–16:15
03.06.2026 | 14:00–16:15
10.06.2026 | 11:30–13:00, 14:00–16:15
17.06.2026 | 14:00–16:15
24.06.2026 | 11:30–13:00, 14:00–16:15
01.07.2026 | 14:00–16:15
Supervisor:
Measurement and Modeling of Energy Consumption in Neural Image and Video Compression
Description
In today's digital era, image and video content dominate online traffic, accounting for the majority of global data transmission [1]. Efficient compression is therefore essential for delivering high-quality content under limited bandwidth and storage constraints. Recently, Deep Neural Networks (DNNs) have emerged as a powerful alternative to traditional compression methods, leveraging nonlinear representations to improve compression efficiency and visual quality [1].
However, beyond compression performance, practical deployment requires careful consideration of computational cost and energy consumption, especially on resource-constrained devices such as smartphones. Importantly, neither the number of parameters nor the number of operations in a DNN directly reflects its true energy consumption. Models with fewer parameters or operations may still consume more energy due to hardware-specific factors such as memory access patterns [2].
This project aims to systematically measure and model the runtime and energy consumption of DNNs across heterogeneous devices, including desktop GPUs, laptops, and/or smartphones. The objectives include refining and extending an existing energy measurement setup to ensure reliability and reproducibility, analyzing the relationship between model structure, runtime, and energy consumption, and designing predictive models for runtime and energy estimation across devices.
Existing approaches in the literature include device-specific [3] and device-adaptive [4] runtime estimation methods. Students are encouraged to explore and extend these approaches and to propose novel runtime and energy modeling techniques for the generalization across devices.
[1] JS Gomes, M. Grellert, FLL Ramos and S. Bampi, "End-to-End Neural Video Compression: A Review," IEEE Open Journal of Circuits and Systems, vol. 6, pp. 120-134, 2025.
[2] X. Yang, J. Kwon, Y. Li and Y. Chen, "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning," CVPR, 2017.
[3] LL Zhang, S. Han, J. Wei, N. Zheng, T. Cao, Y. Yang and Y. Liu, "NN-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices," MobiSys, pp. 81-93, 2021.
[4] C. Feng, LL Zhang, Y. Liu, J. Xu, C. Zhang, Z. Wang, T. Cao, M. Yang and H. Tan, "LitePred: Transferable and Scalable Latency Prediction for Hardware-Aware Neural Architecture Search," NSDI, 2024.
Prerequisites
Strong coding skills in Python and ML libraries, background in machine learning, motivation for research and experimentation, experience with mobile or embedded platforms would be a plus
Contact
serdar.caglar@tum.de
Supervisor:
Working Student for 3D human motion capture, character generation and rigging
Description
The work involves setting up and operating a human motion (body-hands-face) capture system (https://www.rokoko.com/products/smartsuit-pro). The student will build a dataset of a large variety of full-body animations. This involves generating a large list of daily animations, planning and "acting" them, and running the MoCap software in parallel to save the animation data.
Furthermore, the student will use text-to-3D asset generation models to build a library of humanoid characters. These characters then need to be rigged using e.g., Blender. The animations recorded above should then be re-targeted to the rigged characters.
The student employment will be 8-16 hours/week for a couple of months, with the possibility for extension.
Requirements: Prior experience with 3D topics -- 3D software (Blender), character rigging, (optional but nice to have) motion capture hardware and software
Supervisor:
Geometric 3D Gaussian Splatting Compression
Description
3D Gaussian Splatting (3DGS) has demonstrated strong performance in high-quality novel view synthesis and real-time rendering by representing scenes as dense sets of Gaussians with associated attributes [1]. This representation captures fine geometric detail and view-dependent appearance efficiently at render time, contributing to its practical success. However, trained 3DGS models often contain millions of Gaussians, resulting in substantial storage and memory requirements that limit scalability and deployment on resource-constrained systems.
This project focuses on exploiting geometric relationships among Gaussians (i.e., primitives) to reduce redundancy in the attribute space. The goal is to apply sampling and sparsification techniques to decrease the number of stored primitives while preserving perceptual and structural fidelity. The proposed methodologies will be evaluated against existing compression approaches [2]. The project aims to disseminate the results in highly recognized scientific venues; hence, the applicants should be motivated to do research.
[1] Kerbl, Bernhard, et al. "3d gaussian splatting for real-time radiance field rendering." ACM Trans. Graph. 42.4. 2023.
[2] Bagdasarian, Milena T., et al. "3dgs. zip: A survey on 3d gaussian splatting compression methods." Computer Graphics Forum. Vol. 44. No. 2. 2025.
Prerequisites
Python, PyTorch, graphs, Gaussian Splatting, motivation for research
Contact
cem.eteke@tum.de
Supervisor:
Content Safety for Generative Multimedia: Automated Evaluation and Re-Prompting for Age-Appropriate AI-Generated Content
Description
Generative AI models (e.g., LLMs, diffusion models) are increasingly used to create multimodal content (text + images) for applications like interactive storytelling, educational tools, and digital media. However, ensuring that generated content is safe, unbiased, and age-appropriate remains a critical challenge. Manual moderation is unscalable, and existing automated filters often lack contextual understanding or multimodal reasoning.
This thesis explores the development of automated pipelines to evaluate and refine AI-generated content, with a focus on:
- Real-time safety assessment of text-image pairs.
- Automatic re-prompting to guide models toward compliant outputs.
- Adaptability to diverse use cases (e.g., children’s toys, educational platforms).
Objectives
-
Multimodal Safety Evaluation:
- Investigate state-of-the-art metrics (e.g., CLIP-based similarity, toxicity scores, emotional valence) to detect unsafe or age-inappropriate content in text-image pairs.
- Develop a lightweight, interpretable scoring system combining:
- Text analysis (e.g., perspective API, custom fine-tuned classifiers).
- Image analysis (e.g., NSFW detectors, aesthetic/emotional classifiers).
- Cross-modal alignment (e.g., does the image match the text’s intent and safety constraints?).
-
Re-Prompting Strategies:
- Design adaptive prompting techniques to iteratively refine outputs (e.g., using reinforcement learning or constrained decoding).
- Explore few-shot learning to generalize safety rules across domains (e.g., fairy tales vs. scientific explanations).
-
Benchmarking and Evaluation:
- Curate a multimodal dataset of edge cases (e.g., subtle biases, ambiguous contexts).
- Compare against human annotations and existing tools (e.g., Google’s Perspective API, LAION filters).
- Optimize for latency and computational efficiency (critical for embedded/edge devices).