Wissenschaftliches Seminar Medientechnik

Vortragende/r (Mitwirkende/r)
Umfang3 SWS
SemesterSommersemester 2024
Stellung in StudienplänenSiehe TUMonline
TermineSiehe TUMonline



Participants deepen their knowledge in the area of media technology. After completing the course, students are able to scientifically work on a topic in the area of media technology, write a scientific paper and give a scientific talk.


Wechselnde Schwerpunktthemen aus dem Bereich der Medientechnik. Der Fokus liegt auf aktuellen Forschungsthemen und neuen Technologien. Jeder Teilnehmer bereitet selbständig, anhand vorgegebener Einstiegspunkte in die Literatur, einen wissenschaftlichen Fachvortrag (30 Min.) vor und erstellt eine kurze Zusammenfassung der wichtigsten Inhalte (Leitfaden Ausarbeitungen). Lernziele dabei sind die Grundlagen wissenschaftlicher Arbeitsmethodik kennenlernen und sich mit Vortrags- und Präsentationstechniken auseinanderzusetzen. Ein besonderes Merkmal des Hauptseminars Medientechnik ist ein Seminar zur Präsentationstechnik. Sie erhalten neben allgemeinen rhetorischen Hilfsmitteln und Tips zur Gestaltung von Vortragsvorbereitung und Medieneinsatz vor allem viel Gelegenheit zum Üben realistischer Vortragssituationen.

Inhaltliche Voraussetzungen

No specific requirements

Lehr- und Lernmethoden

Every participant works on his/her own topic. The goal of the seminar is to train and enhance the ability to work independently on a scientific topic. Every participant is supervised individually by an experienced researcher. This supervisor helps the student to get started, provides links to the relevant literature and gives feedback and advice on draft versions of the paper and the presentation slides. The main teaching methods are: - Computer-based presentations by the student - The students mainly work with high quality and recent scientific publications

Studien-, Prüfungsleistung

- Scientific paper (30%) - Interaction with the supervisor and working attitude (20%) - Presentation (30 minutes) and discussion (15 minutes) (50%)

Empfohlene Literatur

The following literature is recommended: - will be announced in the seminar


Umbrella topic for SS23/24: "Efficiency Matters: Innovations in Deep Learning Optimization"

The kick-off meeting for the seminar is on 17.04.2024 at 13:15 in Seminar Room 0406

Attendance is mandatory to get a fixed place in the course! 


The media technology scientific seminar this semester is focused on Efficient Deep Learning.

The aim is to explore novel strategies and advancements in efficient deep learning to empower AI systems with enhanced performance while optimizing resource utilization in various domains of application.

More details will be provided during the kick-off Meeting.

In recent years, 3D sensors like LiDAR and RGB-D cameras have been widely applied to capture 3D point clouds of the surrounding world. The goal of semantic segmentation on a point cloud is to assign a semantic label (e.g., car, pedestrian) to each point, thereby enabling machines to understand and interpret the 3D environment. This is a fundamental task for many real-world applications, such as autonomous driving, robotics, and augmented reality. However, a large-scale 3D scan can contain hundreds of millions of points. In addition, 3D point clouds are unstructured and non-uniform, which makes direct use of the successful convolutional neural network (CNN) unavailable. Thus, an accurate 3D semantic segmentation algorithm with high computation and memory efficiency is essential for time-sensitive applications.

To overcome the above-mentioned challenges, RandLA-Net [1] introduces the random sampling technique and proposes an efficient feature aggregation module to optimize both time and memory consumption. PointNeXt [2] revisits the previous PointNet models and proposes multiple modules and tricks to improve the performance and speed. Additionally, PointNeXt provides different model sizes which enables a trade-off between accuracy and efficiency. SPT [3] introduces a fast algorithm to partition the large-scale point cloud into multi-scale patches. Then, a self-attention network is used to aggregate features between neighboring patches. This model achieves a significant boost in inference time. Most recently, Zhu et al. [4] propose a non-parametric network for few-shot 3D semantic segmentation, which additionally optimize the data efficiency and training speed.

During the seminar, we will review the state-of-the-art algorithms for efficient point cloud segmentation. You will analyze and compare different models in their model architecture, training strategy, loss function, etc. Furthermore, you will explore future directions of potential improvement in the model efficiency.


Supervision: Zhifan Ni (zhifan.ni@tum.de)


[1] Q. Hu et al., “RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds,” in CVPR 2020.

[2] G. Qian et al., “PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies,” in NeurIPS 2022.

[3] D. Robert et al., “Efficient 3D Semantic Segmentation with Superpoint Transformer,” in ICCV 2023.

[4] X. Zhu et al., “No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation,” in CVPR 2024.

Most deep learning approaches to vision based 6D pose estimation and tracking are not realtime capable on account of complex neural architectures. Such architectures are also usually difficult and time-consuming to train.

This seminar contribution should review works that depart from the above characteristics, and exhibit efficiency across both training and inference. Also interesting are works that require relatively little data to train or rely on synthetic data instead of real data that need to be laboriously annotated.

Supervision: Dr. Rahul Chaudhari (rahul.chaudhari@tum.de)


[1] Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., & Birchfield, S. (2018). Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790.

[2] Beedu, A., Alamri, H., & Essa, I. (2022). Video based object 6d pose estimation using transformers. arXiv preprint arXiv:2210.13540.

[3] Periyasamy, A. S., Amini, A., Tsaturyan, V., & Behnke, S. (2023). YOLOPose V2: Understanding and improving transformer-based 6D pose estimation. Robotics and Autonomous Systems, 168, 104490.

Wearable Devices require small and energy-efficient hardware for computation-intensive tasks like inference with neural networks. Often, they have to run on batteries, which limits their possible runtime drastically if the hardware is not adapted. Optimizing or reducing the size of the neural network is often not enough then.

In this topic, the student will explore recent developments in hardware development in this context. The work in [1] and [2] will serve as starting point for this survey. [1] presents hardware for classification tasks. In [2], a hardware accelerator for feature extraction is presented.

Supervision:  Lars Nockenberg (lars.nockenberg@tum.de


[1] M. Jobst et al., "ZEN: A flexible energy-efficient hardware classifier exploiting temporal sparsity in ECG data," 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Korea, Republic of, 2022, pp. 214-217, doi: 10.1109/AICAS54282.2022.9869958.

[2] L. Guo et al., "A Low-Power Hardware Accelerator of MFCC Extraction for Keyword Spotting in 22nm FDSOI," 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hangzhou, China, 2023, pp. 1-5, doi: 10.1109/AICAS57966.2023.10168587.

In today's interconnected world, the demand for precise indoor localization has surged across various domains, including smart buildings, healthcare, retail, and transportation. However, traditional localization methods often struggle to provide accurate results in complex indoor environments.

Deep learning techniques have emerged as a popular solution due to their ability to learn complex patterns from data. Nonetheless, the widespread adoption of deep learning for indoor localization is hindered by efficiency challenges, including computational complexity, energy consumption, and deployment constraints.

In this seminar, we would explore recent developments in the field of indoor localization leveraging deep learning methods, with a specific focus on efficiency. The discussion would cover topics such as novel neural network architectures, model compression techniques, hardware optimizations, and real-world applications.

We'll delve into the challenges faced in deploying efficient deep learning models for indoor localization, including scalability, resource constraints, and deployment considerations. Additionally, we'll discuss potential future directions and research opportunities in this rapidly evolving field.


Supervision:    Majdi Abdmoulah (Majdi.abdmoulah@tum.de)


[1] C. -H. Hsieh, J. -Y. Chen and B. -H. Nien, "Deep Learning-Based Indoor Localization Using Received Signal Strength and Channel State Information," in IEEE Access, vol. 7, pp. 33256-33267, 2019.

[2] Z. Chen, M. I. AlHajri, M. Wu, N. T. Ali and R. M. Shubair, "A Novel Real-Time Deep Learning Approach for Indoor Localization Based on RF Environment Identification," in IEEE Sensors Letters, vol. 4, no. 6, pp. 1-4, June 2020, Art no. 7002504.

[3] L. Zhang, Y. Li, Y. Gu and W. Yang, "An efficient machine learning approach for indoor localization," in China Communications, vol. 14, no. 11, pp. 141-150, Nov. 2017

[4] L. Wang, S. Tiku and S. Pasricha, "CHISEL: Compression-Aware High-Accuracy Embedded Indoor Localization With Deep Learning," in IEEE Embedded Systems Letters, vol. 14, no. 1, pp. 23-26, March 2022.


Human-Object Interaction (HOI) detection focuses on discerning how individuals interact with objects, a crucial capability for collaborative robots [1]. Currently, existing HOI detectors frequently suffer from model inefficiencies and unreliable predictions, constraining their effectiveness in real-world applications.

This seminar topic aims to explore how to use an effective HOI detector to provide effective task planning solutions for cooperative robots. And explore where optimization of the model will improve the optimal detection results [2]. In addition to this, the large language model serves as a directly accessible model to assist in task formulation for collaborative robots [3]. How to combine large language models with current HOI detection algorithms for more effective prediction is also a focus of exploration.

During the seminar, your tasks will include researching, gathering, and conducting a comparative analysis of the above mentioned mothods. Furthermore, you will be expected to comprehensively explain the fundamental principles underpinning these methods, assess and contrast their effectiveness, and ultimately deliver a state-of-the-art summary. Additionally, your presentation should encompass future prospects and potential areas for further research within this domain.


Supervision:  Yuankai Wu (yuankai.wu@tum.de)  


[1] Wu, Yuankai; Messaoud, Rayene; Chen, Xiao; Hildebrandt, Arne-Christoph; Baldini, Marco; Patsch, Constantin; Sadeghian, Hamid; Haddadin, Sami; Steinbach, Eckehard: Vision-driven Collaborative Mobile Robotic Human Assistant System for Daily Living Activities. 22nd IFAC World Congress, 2023.

[2]  J. Lim, V. M. Baskaran, J. M. -Y. Lim, K. Wong, J. See and M. Tistarelli, "ERNet: An Efficient and Reliable Human-Object Interaction Detection Network," in IEEE Transactions on Image Processing, vol. 32, pp. 964-979, 2023.

[3] J. Gao, K. -H. Yap, K. Wu, D. T. Phan, K. Garg and B. S. Han, "Contextual Human Object Interaction Understanding from Pre-Trained Large Language Model," ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 13436-13440.