Media Technology Scientific Seminar

Lecturer (assistant)
TypeAdvanced seminar
Duration3 SWS
TermSommersemester 2022
Language of instructionGerman
Position within curriculaSee TUMonline
DatesSee TUMonline

Admission information


Participants deepen their knowledge in the area of media technology. After completing the course, students are able to scientifically work on a topic in the area of media technology, write a scientific paper and give a scientific talk.


All participants will give a scientific talk (30min) on a certain topic. They will get references to related literature and further assistance, if required. In addition, they have to summarize the essentials in writing. The main aim of attending this seminar is to familiarize oneself with scientific working methods as well as gaining experience with modern techniques for speech and presentation. A special characteristic of the Media Technology Seminar is to focus on presentation techniques. In addition to general rhetoric rules, the use of different medias for presentations will be taught. The students will undergo a special training to improve their presentation skills.


No specific requirements

Teaching and learning methods

Every participant works on his/her own topic. The goal of the seminar is to train and enhance the ability to work independently on a scientific topic. Every participant is supervised individually by an experienced researcher. This supervisor helps the student to get started, provides links to the relevant literature and gives feedback and advice on draft versions of the paper and the presentation slides.. The main teaching methods are: - Computer-based presentations by the student - The students mainly work with high quality and recent scientific publications


- Scientific paper (30%) - Interaction with the supervisor and working attitude (20%) - Presentation (30 minutes) and discussion (15 minutes) (50%)

Recommended literature

The following literature is recommended: - will be announced in the seminar


Main subject for SS22: Human-centric Computing for Indoor Environments

  • Our topics have not been assigned to any students yet. 
  • Only students who attend the Seminar on April 27th at 13:15 in Seminar Room 0406 will be given a Fixplatz (We will give priority according to the registration time).
The kick-off meeting for the seminar is on 27.04.2022 at 13:15 in Seminar Room 0406.

The available topics are given below with further details:

In recent years, assistive robots have been used to assist people in achieving a better daily life [1]. From this, how to involve robots in the daily life tasks of humans becomes very interesting research. To address this problem, we are investigating the human activity anticipation for assistive robots. This means that we need to use current or previous information to predict the next human activity and assign it to robots to engage in an ongoing task.

The work of Duarte et al. [2] predict human activity from a data-driven method and embedded the model in the controller of a humanoid robot. Roy et al. [3] introduce an action anticipation method using pairwise human-object interactions and transformers. Girdhar et al. [4] propose an anticipative video transformer to predict human action in the next step, and the output could be assigned to an assistive robot as a potential task.

You need to explore state-of-the-art approaches by understanding the techniques mentioned above for predicting human activity, and evaluating the possibility of their application on the assistive robot [1].

Supervision: Yuankai Wu (


[1] M. Tröbinger et al., "Introducing GARMI - A Service Robotics Platform to Support the Elderly at Home: Design Philosophy, System Overview and First Results," in IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5857-5864, July 2021, doi: 10.1109/LRA.2021.3082012.

[2] N. F. Duarte, M. Raković, J. Tasevski, M. I. Coco, A. Billard and J. Santos-Victor, "Action Anticipation: Reading the Intentions of Humans and Robots," in IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4132-4139, Oct. 2018, doi: 10.1109/LRA.2018.2861569.

[3] D. Roy and B. Fernando, "Action Anticipation Using Pairwise Human-Object Interactions and Transformers," in IEEE Transactions on Image Processing, vol. 30, pp. 8116-8129, 2021, doi: 10.1109/TIP.2021.3113114.

[4] R. Girdhar and K. Grauman, "Anticipative Video Transformer," 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13485-13495, doi: 10.1109/ICCV48922.2021.01325.

In recent years, multiple technologies have been developed to assist or entertain humans in an indoor environment. Such technologies include but are not necessarily limited to assistive robotics and augmented reality. A common denominator between those research areas is a need to understand the 3D indoor environment.

For this, multiple approaches exist, but in this seminar topic, the focus will be on using a CAD database to reconstruct the 3D environment. Scan2Cad [1] is the first work that solely used point cloud data to retrieve similar-looking CAD models from the ShapeNet [2] database. Im2Cad [3] and Mask2Cad [4]  are similar works that only leverage one RGB image to retrieve a model and reconstruct the scene for a single frame. FroDO [5] extends this to a video sequence, thus reconstructing a complete scene. Lastly, MCSS [6] generates multiple model proposals and optimizes the model choice in an unsupervised manner to reconstruct the scene.

In this topic, you will need to create a comprehensive survey, which evaluates multiple state-of-the-art image to cad and video to cad methods on quality and efficiency. Further, a short quality comparison of video to cad with scan to cad methods is appreciated.

Supervision: Driton Salihu (


[1] Avetisyan, Armen et al. “Scan2CAD: Learning CAD Model Alignment in RGB-D Scans.” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019): 2609-2618.

[2] Chang, A.X., Funkhouser, T.A., Guibas, L.J., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). ShapeNet: An Information-Rich 3D Model Repository. ArXiv, abs/1512.03012.

[3] Izadinia, H., Shan, Q., & Seitz, S.M. (2017). IM2CAD. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2422-2431.

[4] Kuo, W., Angelova, A., Lin, T., & Dai, A. (2020). Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve. ECCV.

[5] Li, K., Rünz, M., Tang, M., Ma, L., Kong, C., Schmidt, T., Reid, I.D., Agapito, L.D., Straub, J., Lovegrove, S., & Newcombe, R.A. (2020). FroDO: From Detections to 3D Objects. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 14708-14717.

[6] Hampali, S., Stekovic, S., Sarkar, S.D., Kumar, C.S., Fraundorfer, F., & Lepetit, V. (2021). Monte Carlo Scene Search for 3D Scene Understanding. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13799-13808.

Summary: Human activity recognition is an important problem, having applications ranging from assistive driving to elderly care. Furthermore, in order for the activity recognition pipelines to be an integral part of our everyday lives, the recognition should be done remotely i.e. non-invasive and preserve user's privacy [1]. Nevertheless, utilization of conventional non-invasive sensors such as RGB cameras are not possible in privacy-sensitive environments such as our homes. That is where mmWave radars come into play. They provide valuable data such as the motion while being dense enough such that informed decisions about the raw radar data can be made. 

In this project we will investigate how we can utilize the mmWave radar data for human activity recognition using machine learning methods. This would mean that the student needs to investigate how can we circumvent the problems induced by the sensors such as noisy data and how can we develop tailor-made models for inferring radar data. Approaches can be either fully data-driven for example with Graph Neural Network [2] and a mixture of signal processing and machine learning [3]. The student should compare and contrast different approaches while suggesting what would work the best under the setting of indoor scenes.

Supervision: Cem Eteke (


[1] Li, Xinyu, Yuan He, and Xiaojun Jing. "A survey of deep learning-based human activity recognition in radar." Remote Sensing 11.9 (2019): 1068.

[2] Gong, Peixian, Chunyu Wang, and Lihua Zhang. "Mmpoint-GNN: graph neural network with dynamic edges for human activity recognition through a millimeter-wave radar." 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021.

[3] Wang, Yuheng, et al. "m-Activity: Accurate and Real-Time Human Activity Recognition Via Millimeter Wave Radar." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.


In the field of computer vision, scene understanding has been of high interest for executing tasks in an indoor environment that involves humans and a variety of objects. The scene graph is such a powerful tool that can clearly express the objects, attributes, and relationships between these objects in the scene. The nodes in the scene graph represent the detected target objects, whereas the edges denote the detected pairwise relationships. Compared with the 2D scene graph, the 3D scene graph is able to provide more complete high-level semantic information, which is very helpful for the in-depth understanding of the human-object or human-human interaction in a real-world 3D indoor complex environment.

In this topic, the student needs to review and understand the state-of-the-art approaches for the 3D scene graph generation [1][2][3][4], demonstrate the acquired 3D scene graph from the indoor environment, and discuss the possible downstream tasks with the help of the 3D scene graph.

Supervision: Dong Yang ( 


[1] Wald, Johanna, et al. "Learning 3d semantic scene graphs from 3d indoor reconstructions." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Armeni, Iro, et al. "3d scene graph: A structure for unified semantics, 3d space, and camera." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
[3] Hughes, Nathan, Yun Chang, and Luca Carlone. "Hydra: A Real-time Spatial Perception Engine for 3D Scene Graph Construction and Optimization." arXiv preprint arXiv:2201.13360 (2022).
[4] Kim, Ue-Hwan, et al. "3-d scene graph: A sparse and semantic representation of physical environments for intelligent agents." IEEE transactions on cybernetics 50.12 (2019): 4921-4933.


Precisely estimating the 6D pose of objects is a crucial task for human activity recognition and robotic manipulation. For example, the way in which a tool is grasped, or the angle in which it is held, can lead to completely different activities being performed by the user. The precise pose can also be very relevant in an industrial environment, when assembling delicate pieces. This is a long standing problem in computer vision that still faces severe issues when dealing, for example, with occlusions or symmetrical objects.

Recently some innovative solutions to this problems by, for example, training on realistic synthetic data[1], proposing new convolution operations[2] or adversarial examples[3].

The student is expected to research the state of the art in 6D object pose estimation and analyze the drawbacks and advantages of the existing methods.

Supervision: Diego Prado (


[1] Wang, Gu, et al. "Occlusion-Aware Self-Supervised Monocular 6D Object Pose Estimation." IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[2] Lin, Jiehong, et al. "Sparse Steerable Convolutions: An Efficient Learning of SE (3)-Equivariant Features for Estimation and Tracking of Object Poses in 3D Space." Advances in Neural Information Processing Systems 34 (2021).

[3] Zhang, Jinlai, et al. "Adversarial samples for deep monocular 6D object pose estimation." arXiv preprint arXiv:2203.00302 (2022).


In recent years, attention-based models, such as Transformers, have successfully dominated natural language processing tasks and further shown state of the art performance in vision based tasks such as video classification and action recognition [1].

Arnab et al. [1] introduce video classification transformer that factorizes the input video according to its spatial and temporal dimension, which outperforms prior work based on 3D convolutional neural networks. Zha et al. [2] employ a Transformer- based model for spatio-temporal representation learning in order to be able to consider intra- and inter- frame dependencies within a video. Neimark et al. [3] introduce a video transformer network that relies on extracting features from video frames with a 2D convolutional network to account for spatial information and it utilizes a transformer to model temporal relationships.

You need to explore state-of-the-art approaches by understanding the fundamentals of attention based models (in particular Transformers) and evaluating their applicability in the field of action recognition.

Supervision:  Constantin Patsch (


[1] Arnab, Anurag, et al. "Vivit: A video vision transformer." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

[2] Zha, Xuefan, et al. "Shifted chunk transformer for spatio-temporal representational learning." Advances in Neural Information Processing Systems 34 (2021).

[3] Neimark, Daniel, et al. "Video transformer network." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Sharing the control of a robotic system with an autonomous controller allows a human to reduce his/her cognitive and physical workload during the execution of a task. In recent years, the development of inference and learning techniques has widened the spectrum of applications of shared control (SC) approaches, leading to robotic systems that are capable of seamless adaptation of their autonomy level. In this perspective, shared autonomy (SA) can be defined as the design paradigm that enables this adapting behaviour of the robotic system. This topic collects the latest results achieved by the research community in the field of SC and SA with special emphasis on physical human-robot interaction (pHRI). Architectures and methods developed for SC and SA are discussed throughout the seminar, highlighting the key aspects of each methodology [1].

Supervision: Edwin Babaians  (


[1] Selvaggio, Mario, Marco Cognetti, Stefanos Nikolaidis, Serena Ivaldi, and Bruno Siciliano. "Autonomy in physical human-robot interaction: A brief survey." IEEE Robotics and Automation Letters (2021).


Robot intelligence is evolving since decades, yet it is not mature for the robots to operate fully autonomously. Various teleoperation interfaces have been developed for a human operator to control the robot directly or semi-autonomously. In this seminar topic, we will focus on human hand pose recognition-based user interfaces that are designed for intuitive robotic teleoperation. We will investigate different computer vision algorithms to recognize the human hand pose and map it to a robot frame, e.g. to the end-effector. Ideally, such systems should also allow non-experts to remotely control a complicated robotic system safely and intuitively.

Supervision: Furkan Kaynar (   


[1] Oudah, Munir, Ali Al-Naji, and Javaan Chahl. "Hand gesture recognition based on computer vision: a review of techniques." journal of Imaging 6.8 (2020): 73.

[2] Li, Rui, Hongyu Wang, and Zhenyu Liu. "Survey on mapping human hand motion to robotic hands for teleoperation." IEEE Transactions on Circuits and Systems for Video Technology (2021).

[3] Meeker, Cassie, Thomas Rasmussen, and Matei Ciocarlie. "Intuitive hand teleoperation by novice operators using a continuous teleoperation subspace." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018.

[4] Li, Shuang, et al. "Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network." 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019.

[5] Handa, Ankur, et al. "Dexpilot: Vision-based teleoperation of dexterous robotic hand-arm system." 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020.

The task of Hand-Object pose estimation is the task of jointly estimating the pose of model-based hand and object from different input modalities [1].

Multiple modalities like color/depth [3],  single [1][4] or multi views could be used [2][3]  in order to overcome hand-hand [4] and hand-object occlusion [4]. Furthermore, synthetic data generation could be used to generate/enrich datasets [4][5].

Our goal is to do a literature review, and compare different approaches on common datasets [2][3][4], summarize different approaches to overcome the above mentioned challenges. 

Supervision: Marsil Zakour (


[1] B. Doosti, S. Naha, M. Mirbagheri and D. J. Crandall, "HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6607-6616, doi: 10.1109/CVPR42600.2020.00664.

[2] Hampali, Shreyas, Mahdi Rad, Markus Oberweger, and Vincent Lepetit. “HOnnotate: A Method for 3D Annotation of Hand and Object Poses.” ArXiv:1907.01481 [Cs], May 30, 2020.

[3] Chao, Yu-Wei, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S Narang, et al. “DexYCB: A Benchmark for Capturing Hand Grasping of Objects,” n.d., 10.

[4] Moon, Gyeongsik, Shoou-i Yu, He Wen, Takaaki Shiratori, and Kyoung Mu Lee. “InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image.” ArXiv:2008.09309 [Cs], August 21, 2020.

[5] Hasson, Yana, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. “Learning Joint Reconstruction of Hands and Manipulated Objects.” ArXiv:1904.05767 [Cs], April 11, 2019.

Bilateral teleoperation with haptic feedback provides its users with a new dimension of immersion in virtual or remote environments. This technology enables a great variety of applications in robotics and virtual reality, such as remote surgery and industrial digital twin [1].

Subjective quality of experience (QoE) reflects the operator’s impression of the force interaction with the remote environment, and is fundamentally determined by the force feedback signal rendered by the haptic device [2]. However, unlike subjective assessment of visual signals like video and images, the subjective evaluation of haptic signals is remarkably more complicated and time-consuming to be conducted, because subjects need specific haptic hardware to experience the force signal [3]. Subjective QoE assessment models learn from existing subjective experimental results and predict the users’ evaluations on feedback signals. To achieve real-time evaluation performance, reduced-reference or no-reference evaluation methods are more favorable.

Your quest is to explore state-of-the-art reduced-reference or no-reference assessment methods for subjective QoE evaluation, and incorporate the knowledge with the real-time teleoperation systems with different control algorithms to discover new methods for online haptic signal subjective QoE assessment.

Supervision: Zican Wang (


[1] P. F. Hokayem, M. W. Spong, “Bilateral teleoperation: An historical survey,” Automatica, vol. 42, no. 12, pp. 2035-2057, December 2006.

[2] C. Passenberg, A. Peer, M. Buss, “A survey of environment-, operator-, and task-adapted controllers for teleoperation systems,” Mechatronics, vol. 20, no. 7, pp. 787-801, October 2010.[1] P. F. Hokayem, M. W. Spong, “Bilateral teleoperation: An historical survey,” Automatica, vol. 42, no. 12, pp. 2035-2057, December 2006.

[3] K. Antonakoglou, et al., “Toward Haptic Communications Over the 5G Tactile Internet,” IEEE Comm. Surveys & Tutorials, vol. 20, no. 4, pp. 3034-3059, June 2018.

"State-of-the-art object detection models achieve great results but require an expensive annotation process to be trained. An emerging concept consists in leveraging natural language (e.g. captions, speech, ...) to alleviate the annotation burden. In this seminar topic, we investigate recent advances in object detection using both image and language as input modalities.

Ye et al. Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection [1].

Maaz et al. Multi-modal Transformers Excel at Class-agnostic Object Detection [2].

Zereian et al. Open-Vocabulary Object Detection Using Captions [3].

Zhong et al. Learning to Generate Scene Graph from Natural Language Supervision [4].

Supervision: Piccolrovazzi Martin (  


[1] Ye et al. "Cap2Det Learning to Amplify Weak Caption Supervision for Object Detection“, IEEE/CVF International Conference on Computer Vision. 2019

[2] Maaz et al. Multi-modal Transformers Excel at Class-agnostic Object Detection

[3] Zereian et al. "Open-Vocabulary Object Detection Using Captions" , Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021

[4] Zhong et al. "Learning to Generate Scene Graph from Natural Language Supervision", Proceedings of the IEEE/CVF International Conference on Computer Vision,

In shared control teleoperation, the robot intelligence and human input can be blended together to achieve improved task performance and reduce the human workload. Shared control especially gains importance when the network quality during teleoperation is poor and also when the robot intelligence is not enough to perform the task autonomously. However, for a successful shared control approach there are three very important questions that need to be addressed:

1)What is the intent of the user?

2)How the robot can gain intelligence about the task and how confident it is?

3)How to blend the user input and robot intelligence?

In this topic, we will investigate state-of-the-art approaches which refer to the above questions and discuss their usability and shortcomings.

Supervision: Supervision: Basak Gülecyüz (    


[1] Dragan AD, Srinivasa SS. A policy-blending formalism for shared control. The International Journal of Robotics Research. 2013

[2] A. K. Tanwani and S. Calinon, "A generative model for intention recognition and manipulation assistance in teleoperation," 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017

[3] F. Abi-Farraj, T. Osa, N. P. J. Peters, G. Neumann and P. R. Giordano, "A learning-based shared control architecture for interactive task execution," 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017

[4] K. T. Ly, M. Poozhiyil, H. Pandya, G. Neumann and A. Kucukyilmaz, "Intent-Aware Predictive Haptic Guidance and its Application to Shared Control Teleoperation," 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), 2021

[5] M. Zurek, A. Bobu, D. S. Brown and A. D. Dragan, "Situational Confidence Assistance for Lifelong Shared Autonomy," 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021

[6] Oh, Yoojin, et al. “Learning arbitration for shared autonomy by hindsight data aggregation.” ArXiv Preprint ArXiv:1906.12280, 2019

A key factor for safe human-robot interaction is the reliabilty of the robots behaviour. For example, it must be guaranteed that a robotic arm will not hit its human operator or an autonomous vehicle will not crash into pedestrians [1]. Especially in indoor environments, where many robots and humans are sharing a constrained space [3], an accurate and stable localization is essential. Many other systems like motion planning rely on a precise pose estimation.

However, if the estimated pose of a robot is incorrect, it can have dramatic consequences. For example, an Automated Guided Vehicle (AGV) could crash into another AGV or even into a person if it has errorneous knowledge about his localization and the motion planning fails. Yet, it is very likely that the AGV is observed by another AGV nearby, which would be able to detect the failed localization and communicate it. For example, Murai et al. [2] recently proposed the Robot Web, which allows a large number of robots to communicate with each other to opimize their pose estimations, even with faulty sensor readings.

For this topic, the student should investigate the proposed Robot Web [2] and do an extensive literature research. In the end, the student should be able to present the key aspects of many-device localization and show the advantages for human centric computing.

Supervision: Sebastian Eger (


[1] Lasota, Przemyslaw A., Terrence Fong, and Julie A. Shah. 2017. “A Survey of Methods for Safe Human-Robot Interaction.” Foundations and Trends in Robotics 5 (3): 261–349.

[2] Murai, Riku, Joseph Ortiz, Sajad Saeedi, Paul H. J. Kelly, and Andrew J. Davison. 2022. “A Robot Web for Distributed Many-Device Localisation.” ArXiv:2202.03314 [Cs], February.

[3] Bechtsis, Dimitrios, Naoum Tsolakis, Dimitrios Vlachos, and Eleftherios Iakovou. 2017. “Sustainable Supply Chain Management in the Digitalisation Era: The Impact of Automated Guided Vehicles.” Journal of Cleaner Production 142 (January): 3970–84.