Media Technology Scientific Seminar

Lecturer (assistant)	Eckehard Steinbach Hasan Burak Dogaroglu Marsil Zakour Rahul Chaudhari
Number	0820906570
Type	advanced seminar
Duration	3 SWS
Term	Sommersemester 2025
Language of instruction	German
Position within curricula	See TUMonline
Dates	See TUMonline

23.04.2025 13:15-14:45 0406, Seminarraum
30.04.2025 13:15-14:45 0406, Seminarraum
07.05.2025 13:15-14:45 0406, Seminarraum
14.05.2025 13:15-14:45 0406, Seminarraum
21.05.2025 13:15-14:45 0406, Seminarraum
28.05.2025 13:15-14:45 0406, Seminarraum
04.06.2025 13:15-14:45 0406, Seminarraum
11.06.2025 13:15-14:45 0406, Seminarraum
18.06.2025 13:15-14:45 0406, Seminarraum
25.06.2025 13:15-14:45 0406, Seminarraum
02.07.2025 13:15-14:45 0406, Seminarraum
09.07.2025 13:15-14:45 0406, Seminarraum
16.07.2025 13:15-14:45 0406, Seminarraum
23.07.2025 13:15-14:45 0406, Seminarraum

Admission information

Objectives

Participants deepen their knowledge in the area of media technology. After completing the course, students are able to scientifically work on a topic in the area of media technology, write a scientific paper and give a scientific talk.

Description

All participants will give a scientific talk (30min) on a certain topic. They will get references to related literature and further assistance, if required. In addition, they have to summarize the essentials in writing. The main aim of attending this seminar is to familiarize oneself with scientific working methods as well as gaining experience with modern techniques for speech and presentation. A special characteristic of the Media Technology Seminar is to focus on presentation techniques. In addition to general rhetoric rules, the use of different medias for presentations will be taught. The students will undergo a special training to improve their presentation skills.

Prerequisites

No specific requirements

Teaching and learning methods

Every participant works on his/her own topic. The goal of the seminar is to train and enhance the ability to work independently on a scientific topic. Every participant is supervised individually by an experienced researcher. This supervisor helps the student to get started, provides links to the relevant literature and gives feedback and advice on draft versions of the paper and the presentation slides..

The main teaching methods are:
- Computer-based presentations by the student
- The students mainly work with high quality and recent scientific publications

Examination

- Scientific paper (30%)
- Interaction with the supervisor and working attitude (20%)
- Presentation (30 minutes) and discussion (15 minutes) (50%)

Recommended literature

The following literature is recommended: - will be announced in the seminar

Links

Umbrella topic for WS24/25: "Realism Redefined: Perceptual Evaluation in Neural Networks"

The kick-off meeting for the seminar is on 23.04.2025 at 13:15 in Seminar Room 0406.

Attendance is mandatory to get a fixed place in the course!

This semester's Media Technology scientific seminar is focused on Perceptual Deep Learning. The aim is to investigate its potential, advancements, and future directions in various application domains. More details will be provided during the kick-off Meeting.

Important notice regarding "Fixed places" in the seminar:

Registering in TUM Online will change your status to "Requirements met." During the seminar kick-off, the fixed places are assigned to the students according to a priority list. Thus, attending the kickoff session is mandatory to secure a fixed place at the seminar. However, please note that due to the high demand, not all students will necessarily get a fixed place as the spots are limited. So, it's crucial to register as soon as possible.

Kinaesthetic feedback is a crucial component of every haptic interaction. It provides the user with the sensation of touching rigid objects, feeling the weight and resistance of obstacles, and receiving force feedback during collisions. However, kinaesthetic signals are extremely sensitive to delay and must be sampled at high frequencies (at least 1 kHz). To reduce the demanding requirements of kinaesthetic communication, several compression and control techniques have been developed [1].

The presence of delay and signal distortion—whether due to compression or transmission—significantly impacts the user's Quality of Experience (QoE). Despite this, most existing studies assess QoE using subjective evaluations, with relatively few works proposing standardized, objective metrics for kinaesthetic feedback [2].

This seminar topic will focus on reviewing the current state of research in the perceptual evaluation of kinaesthetic signals. In particular, it will explore both subjective (e.g., user studies, rating scales) and objective (e.g., signal-based or model-based) metrics used to quantify QoE. Additionally, the seminar will examine recent applications of deep learning techniques—such as Siamese Networks, CNNs, and LSTMs—in modeling human perception, especially within haptics and related fields [3, 4].

The student’s main tasks will include conducting a comprehensive literature review, analyzing and categorizing existing QoE assessment methods for kinaesthetic feedback, and comparing their advantages, limitations, and applicability. The outcome of this work should be a critical analysis and summary that can serve as a foundation for future research in perceptual modeling of kinaesthetic QoE.

Supervision: Daniel Rodriguez (daniel.rodriguez@tum.de)

References:

[1] E. Steinbach et al., "Haptic Communications," in Proceedings of the IEEE, vol. 100, no. 4, pp. 937-956, April 2012,

[2] Xu, X., Zhang, S., Liu, Q., & Steinbach, E. (2021, September). QoE-driven delay-adaptive control scheme switching for time-delayed bilateral teleoperation with haptic data reduction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 8838-8844). IEEE.

[3] Deng, Q., Mahmoodi, T., & Aghvami, A. H. (2023). A long short-term memory-based model for kinesthetic data reduction. IEEE Internet of Things Journal, 10(19), 16975-16988.

[4] Priyadarshini, K., Chaudhuri, S., & Chaudhuri, S. (2019, July). Perceptnet: Learning perceptual similarity of haptic textures in presence of unorderable triplets. In IEEE World Haptics Conference (WHC).

Camera-based perceptual deep learning has emerged as a critical enabler in the field of assistive robotics, allowing robots to perceive, interpret, and interact with their environment in real time. By leveraging deep neural networks, especially convolutional and transformer-based models, assistive robots can now detect objects, recognize human activities, and understand contextual cues necessary for providing support to individuals with mobility, cognitive, or sensory impairments.

The recent integration of large pre-trained models, such as Vision-Language Models (VLMs) or Multimodal Foundation Models, brings new capabilities to assistive robots. These models offer generalization and reasoning that go far beyond traditional deep learning systems. Using models like CLIP [1] or BLIP [2], robots can understand phrases like “bring me the red book on the left shelf” and map them to visual regions in real-time. Large-scale action recognition models (e.g., VideoMAE [3], SlowFast) allow robots to recognize if a person is struggling to get up, reaching for medication, or needs help. Integration of multimodal LLMs (like GPT-4V [4], MiniGPT-4, or OpenFlamingo) enables robots to take visual input and respond conversationally, improving accessibility for users with speech or motor impairments.

During the seminar, your task is to read articles related to this topic and independently explore how visual-based large model perception systems can provide assistance to assistive robots.

Supervision: Yuankai Wu (yuankai.wu@tum.de)

References:

[1] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PmLR, 2021.

[2] Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022.

[3] Tong, Zhan, et al. "Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training." Advances in neural information processing systems 35 (2022): 10078-10093.

[4] Yang, Zhengyuan, et al. "The dawn of lmms: Preliminary explorations with gpt-4v (ision)." arXiv preprint arXiv:2309.17421 9.1 (2023): 1.

Understanding subtle differences in human task execution—such as between expert and novice actions—has become increasingly important for applications in training, assistive systems, and robotics. Recognizing not just what action is performed but how well it is executed requires perceptual models capable of detecting nuanced errors.

This seminar explores the Multimodal Perception of Error-Prone Human Actions in Videos, focusing on contrastive perceptual learning to distinguish between correct and incorrect executions of the same task. By leveraging paired egocentric video sequences—expert vs novice or success vs failure—contrastive learning enables models to learn error-aware action embeddings that reflect perceptual differences in quality and intent.

Egocentric videos, with their rich hand-object interactions and first-person perspective, offer a natural medium for such modeling. When aligned temporally, these pairs allow for robust representation learning across visual, temporal, and spatial cues.

During the seminar, you'll review recent approaches for mistake and error understanding, examine the role of multimodal data (e.g., video, gaze, hand motion), and assess how well these models capture indicators of human error. The presentation should conclude with a state-of-the-art summary and outline open challenges and future research opportunities.

Supervision: Constantin Patsch (constantin.patsch@tum.de)

References:

[1] M. Doughty, W. Kay, D. Damen, and G. Csurka, “The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7862–7871, 2020.

[2] Y. Tang, L. Shao, and J. Wang, “COIN: A Large-scale Dataset for Comprehensive Instructional Video Analysis,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1207–1216, 2019.

[3] Ding, Guodong, et al. "Spatial and temporal beliefs for mistake detection in assembly tasks." Computer Vision and Image Understanding 254 (2025): 104338.

Bilateral teleoperation systems with haptic feedback enable operators to interact with objects or perform intricate tasks in remote or unreachable environments. For shared control [1] and model-mediated teleoperation (MMT) [2], reconstructing the precise remote environmental conditions improves the system’s performance and provides operators with an immersive experience.

This topic aims to achieve an accurate environmental reconstruction with visual-haptic feedback. Machine learning and neural networks are employed to support various teleoperation tasks, such as grasping and placing new objects. For visual reconstruction, a 360-degree panoramic visual scene can be restored from 2D images using 3D Gaussian Splatting [3]. Using a depth camera and a robot arm, the shape update and object position adjustment can be detected and restored for the grasping task [4-5]. For haptic reconstruction, a mesh model based on 3D Gaussian Splatting that provides force feedback information can be obtained [6]. For unknown object grasping [9], a shape completion using uncertain region prediction [7] with a Vector Quantized Deep Implicit Function (VQDIF) [8] is proposed.

For this topic, the expected achievements include comprehending the reconstruction strategies provided in the papers and looking for other related articles on your own. The summary and comparison should be included in the report and your presentation. Moreover, you need to envision the direction and content of future research based on your work.

Supervision: Siwen Liu (siwen.liu@tum.de)

References:

1] G. Li, Q. Li, C. Yang, Y. Su, Z. Yuan and X. Wu, "The Classification and New Trends of Shared Control Strategies in Telerobotic Systems: A Survey," in IEEE Transactions on Haptics, vol. 16, no. 2, pp. 118-133, April-June 2023, doi: 10.1109/TOH.2023.3253856.

[2] X. Xu, B. Cizmeci, C. Schuwerk and E. Steinbach, "Model-Mediated Teleoperation: Toward Stable and Transparent Teleoperation Systems," in IEEE Access, vol. 4, pp. 425-449, 2016, doi: 10.1109/ACCESS.2016.2517926.

[3] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 2023. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM Trans. Graph. 42, 4, Article 139 (August 2023), 14 pages.

[4] L. Rustler, J. Lundell, J. K. Behrens, V. Kyrki and M. Hoffmann, "Active Visuo-Haptic Object Shape Completion," in IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5254-5261, April 2022, doi: 10.1109/LRA.2022.3152975

[5] L. Rustler, J. Matas and M. Hoffmann, "Efficient Visuo-Haptic Object Shape Completion for Robot Manipulation," 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 2023, pp. 3121-3128, doi: 10.1109/IROS55552.2023.10342200.

[6] A. Guédon and V. Lepetit, "SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering," 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024, pp. 5354-5363, doi: 10.1109/CVPR52733.2024.00512.

[7] Matthias Humt, Dominik Winkelbauer, and Ulrich Hillen brand. “Shape Completion with Prediction of Uncertain Regions”. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023. 2023.

[8] Xingguang Yan et al. “ShapeFormer: Transformer-based Shape Completion via Sparse Representation”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

E. Miller et al., "Unknown Object Grasping for Assistive Robotics," 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 18157-18163, doi: 10.1109/ICRA57147.2024.10611347.

[9] E. Miller et al., "Unknown Object Grasping for Assistive Robotics," 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 2024, pp. 18157-18163, doi: 10.1109/ICRA57147.2024.10611347.

Vibrotactile signals represent vibrations that can be used to convey additional information or increase immersion of displayed media. This is relevant for applications like teleoperation or virtual reality. They can be described through selected parameters that represent the human perceptual space [1]. These parameters are related to attributes that humans use to describe a vibration. Such a model can be utilized for the authoring of plausible vibrations, which means that vibrations are generated based on context in a scene and derived required properties of the signals [2].

In this topic, your task will be to review the approach in [1,2] and related literature.

Supervision: Lars Nockenberg (lars.nockenberg@tum.de)

References:

[1] R. Rosenkranz und M. E. Altinsoy, „Mapping the Sensory-Perceptual Space of Vibration for User-Centered Intuitive Tactile Design“, IEEE Trans. Haptics, Bd. 14, Nr. 1, S. 95–108, Jan. 2021, doi: 10.1109/TOH.2020.3015313.

[2] R. Rosenkranz und M. E. Altinsoy, „A Perceptual Model-Based Approach to Plausible Authoring of Vibration for the Haptic Metaverse“, IEEE Trans. Haptics, S. 1–14, 2023, doi: 10.1109/TOH.2023.3318644

The Tactile Internet enables users to interact with remote or virtual environments with high fidelity haptic feedback [1]. To create truly immersive experiences, it is essential to provide multimodal sensory feedback that allows users to perceive an object’s shape, material properties, and stiffness accurately. A variety of multimodal haptic devices have been developed to deliver such rich feedback [2]. However, there is still a lack of a comprehensive tactile perceptual model that captures how humans integrate touch with other sensory modalities. Bridging this gap would significantly enhance the design of multimodal haptic devices and improve tactile data compression strategies.

This seminar topic aims to explore techniques for multimodal tactile perceptual modelling. This includes both intra-tactile dimensions—such as temperature, force, and vibration—as well as cross-modal integration involving touch with vision [3] and audio. The seminar will also address aspects of multisensory integration, cognitive workload [4], and how deep learning can be used to model and predict human perceptual responses across these modalities.

During the seminar, your tasks will include researching, gathering, and conducting a comparative analysis of the current multimodal tactile perceptual modelling approaches. Furthermore, you will be expected to comprehensively compare and explain these methods, assess and contrast their effectiveness, and ultimately deliver a state-of-the-art summary. Additionally, your presentation should encompass future prospects and how these methods and conclusions could benefit multimodal tactile codec and haptic device design.

Supervision: Wenxuan Wei (wenxuan.wei@tum.de)

References:

[1] E. Steinbach et al., "Haptic Codecs for the Tactile Internet," in Proceedings of the IEEE, vol. 107, no. 2, pp. 447-470, Feb. 2019

[2] D. Wang, K. Ohnishi and W. Xu, "Multimodal Haptic Display for Virtual Reality: A Survey," in IEEE Transactions on Industrial Electronics, vol. 67, no. 1, pp. 610-623, Jan. 2020

[3] Gao, Yang, et al. "Deep learning for tactile understanding from visual and haptic data." 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016.

[4] Marucci, Matteo, et al. "The impact of multisensory integration and perceptual load in virtual reality settings on performance, workload and presence." Scientific Reports 11.1 (2021): 4831.