Seminar Media Technology

Lecturer (assistant)
TypeAdvanced seminar
Duration3 SWS
TermWintersemester 2023/24
Language of instructionGerman
Position within curriculaSee TUMonline
DatesSee TUMonline

Admission information

See TUMonline
Note: registration via TUMOnline


Participants deepen their knowledge in the area of media technology. After completing the course, students are able to scientifically work on a topic in the area of media technology, write a scientific paper and give a scientific talk.


Selected topics in media technology. The focus is on current research topics and new technologies. The participants study recent publications, prepare a summary in the form of a scientific paper and present the topic to the audience.


No specific requirements

Teaching and learning methods

Every participant works on his/her own topic. The goal of the seminar is to train and enhance the ability to work independently on a scientific topic. Every participant is supervised individually by an experienced researcher. This supervisor helps the student to get started, provides links to the relevant literature and gives feedback and advice on draft versions of the paper and the presentation slides.. The main teaching methods are: - Computer-based presentations by the student - The students mainly work with high quality and recent scientific publications


- Scientific paper (50%) - Presentation (30 minutes) and discussion (15 minutes) (50%)

Recommended literature

The following literature is recommended: - will be announced in the seminar


Umbrella topic for WS23/24: "Generative AI: Redefining Possibilities in Science and Technology"

The kick-off meeting for the seminar is on 18.10.2023 at 13:15 in Seminar Room 0406

Attendance is mandatory to get a fixed place in the course! 


The media technology scientific seminar in this semester is focused on Generative AI. The aim is to investigate its potential, advancements and future directions in various domains of application. More details will be provided during the kick-off Meeting. 

The available specific topics are listed below:

The diffusion models shows good performance in high quality image generation [1,2,3]. This led to adoption in a wide variety of generative tasks. However, these models usually require immense computational resources where some algorithms need to run more than 1000 times to generate one single image.

To alleviate this, some researchers proposed to change the sampling scheme by making it deterministic [4] or incorporating better differential equation solvers [5,6]. These approaches are shown to reduce the number of model calls to a range of 10-50. Moreover, some other techniques try to distill the learned dynamics of the diffusion model to a one-shot image generator by forcing it to generate its own synthetic dataset [7,8,9]. Some distillation techniques can skip this costly step and extract the knowledge directly from the model too [10]. Finally, a recent algorithm shows ways to distill knowledge from a pretrained diffusion model or directly train a one-shot image generator with a few modifications to standard diffusion model training pipeline [11]. This technique also allows us to generate better quality images by running the model multiple times.


In this seminar, the student is expected to learn the working mechanisms of the diffusion models, find state-of-the-art approaches to optimize them for image generation tasks, produce a well written summary of the techniques in this research domain and present their findings to an audience at the end of the semester.


Supervision: Burak Dogaroglu (


[1] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” arXiv, 2015. doi: 10.48550/ARXIV.1503.03585. Available:

[2] Y. Song and S. Ermon, “Generative Modeling by Estimating Gradients of the Data Distribution.” arXiv, 2019. doi: 10.48550/ARXIV.1907.05600. Available:

[3] J. Ho, A. Jain, and P. Abbeel, “Denoising Diffusion Probabilistic Models.” arXiv, 2020. doi: 10.48550/ARXIV.2006.11239. Available:

[4] J. Song, C. Meng, and S. Ermon, “Denoising Diffusion Implicit Models.” arXiv, 2020. doi: 10.48550/ARXIV.2010.02502. Available:

[5] C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps.” arXiv, 2022. doi: 10.48550/ARXIV.2206.00927. Available:

[6] Q. Zhang and Y. Chen, “Fast Sampling of Diffusion Models with Exponential Integrator.” arXiv, 2022. doi: 10.48550/ARXIV.2204.13902. Available:

[7] E. Luhman and T. Luhman, “Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed.” arXiv, 2021. doi: 10.48550/ARXIV.2101.02388. Available:

[8] H. Zheng, W. Nie, A. Vahdat, K. Azizzadenesheli, and A. Anandkumar, “Fast Sampling of Diffusion Models via Operator Learning.” arXiv, 2022. doi: 10.48550/ARXIV.2211.13449. Available:

[9] X. Liu, C. Gong, and Q. Liu, “Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow.” arXiv, 2022. doi: 10.48550/ARXIV.2209.03003. Available:

[10] T. Salimans and J. Ho, “Progressive Distillation for Fast Sampling of Diffusion Models.” arXiv, 2022. doi: 10.48550/ARXIV.2202.00512. Available:

[11] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models.” arXiv, 2023. doi: 10.48550/ARXIV.2303.01469. Available:

In haptics, we aim at displaying sensations of touch to the user. One possible application can be to represent realistic textures via vibrations. These vibrations (or tactile signals) could be generated from images, which has the advantage that a correspondence of visual and haptic feedback is achieved. Generative Adversarial Networks (GANs) are a tool that has been used in the past to process texture images and create artificial tactile signals from their spectrogram.

The student is asked to comprehend and summarize some papers in the research area described above. Sources [1] and [2] as well as [3] and [4] are connected. They aim at generating tactile signals for tablet-like screens.

Supervision: Lars Nockenberg (


[1] S. Cai, Y. Ban, T. Narumi, and K. Zhu, “FrictGAN: Frictional Signal Generation from Fabric Texture Images using Generative Adversarial Network,” ICAT-EGVE 2020 - International Conference on Artificial Reality and Telexistence and Eurographics Symposium on Virtual Environments, p. 5 pages, 2020, doi: 10.2312/EGVE.20201254.

[2] S. Cai, L. Zhao, Y. Ban, T. Narumi, Y. Liu, and K. Zhu, “GAN-based image-to-friction generation for tactile simulation of fabric material,” Computers & Graphics, vol. 102, pp. 460–473, Feb. 2022, doi: 10.1016/j.cag.2021.09.007.

[3] Y. Ujitoko and Y. Ban, “Vibrotactile Signal Generation from Texture Images or Attributes Using Generative Adversarial Network,” in Haptics: Science, Technology, and Applications, vol. 10894, D. Prattichizzo, H. Shinoda, H. Z. Tan, E. Ruffaldi, and A. Frisoli, Eds., in Lecture Notes in Computer Science, vol. 10894. , Cham: Springer International Publishing, 2018, pp. 25–36. doi: 10.1007/978-3-319-93399-3_3.

[4] Y. Ujitoko, Y. Ban, and K. Hirota, “GAN-Based Fine-Tuning of Vibrotactile Signals to Render Material Surfaces,” IEEE Access, vol. 8, pp. 16656–16661, 2020, doi: 10.1109/ACCESS.2020.2968185.

Indoor localization has gained significant importance in recent years, with applications ranging from asset tracking in industrial environments to providing location-based services in shopping malls and museums. Fingerprinting-based indoor localization is a popular technique that relies on creating a database of signal fingerprints to estimate a user's location. However, maintaining and updating these fingerprint databases can be cumbersome and time-consuming and is considered one of the major drawbacks of fingerprinting [1].

This seminar topic aims to explore the integration of generative AI techniques into fingerprinting-based indoor localization systems. Generative AI, particularly deep learning models like Generative Adversarial Networks (GANs) [1] and Variational Autoencoders (VAEs), have shown remarkable capabilities in generating data and enhancing various localization tasks. By leveraging generative AI, we can potentially address the challenges associated with traditional fingerprinting methods and improve the accuracy and efficiency of indoor localization systems [2,3,4].

During the seminar, your tasks will include researching, gathering, and conducting a comparative analysis of the current generative AI-based approaches to improve fingerprinting-based indoor localization. Furthermore, you will be expected to comprehensively explain the fundamental principles underpinning these methods, assess and contrast their effectiveness, and ultimately deliver a state-of-the-art summary. Additionally, your presentation should encompass future prospects and potential areas for further research within this domain.

Supervision:  Majdi Abdmoulah (  


[1] W. Njima, A. Bazzi, and M. Chafii, “DNN-Based Indoor Localization Under Limited Dataset Using GANs and Semi-Supervised Learning,” IEEE Access, vol. 10, pp. 69896–69909, 2022.

[2] D. Quezada-Gaibor, J. Torres-Sospedra, J. Nurmi, Y. Koucheryavy, and J. Huerta, “SURIMI: Supervised Radio Map Augmentation with Deep Learning and a Generative Adversarial Network for Fingerprint-based Indoor Positioning,” in 2022 IEEE 12th International Conference on Indoor Positioning and Indoor Navigation (IPIN), Sep. 2022, pp. 1–8.

[3] W. Liu, Y. Zhang, Z. Deng, and H. Zhou, “Low-Cost Indoor Wireless Fingerprint Location Database Construction Methods: A Review,” IEEE Access, vol. 11, pp. 37535–37545, 2023

[4] X.-Y. Liu and X. Wang, “Real-Time Indoor Localization for Smartphones Using Tensor-Generative Adversarial Nets,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 8, pp. 3433–3443, Aug. 2021.

In recent years, high-resolution image generation models, such as the Latent Diffusion Model (LDM) [1] and DALL-E2 [2], achieved remarkable performance and have been successfully applied in many daily-life areas. However, video generation is still a challenging task. Especially for the generation of a long photorealistic video, the long-term connections between frames must be properly modeled, and the computation is often very intensive.

Several diffusion-based models are proposed to tackle these difficulties. For example, Ho et al. [3] propose a 3D U-Net to extract temporal relations between frames. Blattmann et al. [4] insert temporal layers into a pre-trained image LDM to maintain the temporal coherence. Luo et al. [5] introduce a decomposed diffusion process, which shares the same base noise for all frames in a video. Its image backbone is a pre-trained DALL-E2 generator.

In this seminar, we will explore the state-of-the-art models for video generation. The student is expected to review and compare at least two recent approaches. If the model source codes are publicly available, the student is encouraged to try the models to generate own videos.

Supervision:  Zhifan Ni (


[1] Rombach et al., “High-Resolution Image Synthesis with Latent Diffusion Models”, CVPR 2022.

[2] Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv.2204.06125.

[3] Ho et al., “Video Diffusion Models”, Neurips 2022.

[4] Blattmann et al., “Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models”, CVPR 2023.

[5] Luo et al., “VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation”, CVPR 2023.


Radar sensing itself is an old technology, developed in the early stage of the 20th century. Right now, radar is living through a renaissance due to easy and cost-effective availability of sensors with sufficient performance for most applications.

The application of AI-based algorithms is currently a hot topic in radar research, since all the current processing and detection approaches are based on conventional / not learning-based algorithms. This yields a large field of research for radar in the upcoming years.


While radar is currently mostly used in automotive to increase security and reliability for automotive driving, there is also other applications in different fields of research for this kind of sensory. One example is imaging and detection for (static) objects in indoor environments. Since imaging radars provide information of horizontal and vertical detection angle as well as depth information of the present scene, there is the possibility to use this information to generate an estimation of the environment in form of a generated natural image, similar like in [1].

The goal of this seminar work is to research and summarize multiple possible generative approaches for image generation, applicable on radar data. The input could be raw radar data in the shape of time samples, or processed radar images (Range-Doppler map, Range-Angle map), which concludes in a kind of image-to-image translation, similar like in [2]. Further, the work should also include an outlook on [3] as a generator instance for the given input.

The student is required to understand and summarise the state-of-the-art approaches of HDR image enhancement and tone mapping.This will include the explanation of different network structures, optimisation goals, metrics in evaluation and comparison to previous work. The student is encouraged to read and include more related research papers before presenting their findings in a seminar paper and presentation.

Supervision:  Stefan Hägele (


[1] Kato, Sorachi et al. "CSI2Image: Image Reconstruction from Channel State Information Using Generative Adversarial Networks."  IEEE Access (Volume 9), (2021).

[2] Isola, Phillip et al. "Image-to-Image Translation with Conditional Adversarial Networks"  IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017).

[3] Rombach, Robin et al. "High-Resolution Image Synthesis with Latent Diffusion Models"  IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2022).

Recent advancements in learned image compression have surpassed traditional hand-engineered algorithms such as JPEG [1], and have even achieved comparable rate-distortion performance with modern video coding standards like VVC [2]. Among the successful approaches, autoencoders with an entropy-constrained bottleneck based on [3, 4] have shown promising results, where the entropy of the latent elements is jointly modeled and minimized with an image distortion metric. The means-squared error (MSE) is commonly used as a distortion metric to obtain pixel-wise high-fidelity reconstructions. However, the networks trained with MSE yield blurry reconstructions in low bitrate regimes. Perceptually optimized metrics such as MS-SSIM [5] and LPIPS [6] were also investigated in many studies for generating more “pleasant” reconstructions by preserving better textures even in low bitrates. However, those metrics suffer from other artifacts, such as checkerboard artifacts and poor reconstruction of text content. More recent studies [7, 8] demonstrated that generative adversarial network (GAN)-based training schemes can provide higher subjective quality results for extremely low bitrates. Furthermore, researchers [9, 10] also explored diffusion-based models outperforming their GAN-based counterparts.

The student is tasked with comprehending and summarizing the current state-of-the-art approaches in generative learned image compression algorithms. This entails explaining various network structures, training techniques, metrics for evaluation, and comparisons with previous works. The student is encouraged to conduct extensive research and incorporate relevant research papers to augment their findings, which will be presented in a seminar paper and presentation.

Supervision:  A. Burakhan Koyuncu (


[1] Wallace, Gregory K. "The JPEG still picture compression standard." IEEE transactions on consumer electronics 38.1 (1992): xviii-xxxiv.

[2] “Versatile Video Coding,” Standard, Rec. ITU-T H.266 and ISO/IEC 23090-3, Aug. 2020.

[3] Ballé, Johannes, et al. “Variational image compression with a scale hyperprior”. ICLR 2018

[4] Minnen, David, Johannes Ballé, and George D. Toderici. "Joint autoregressive and hierarchical priors for learned image compression." NeurIPS 2018

[5] Wang, Zhou, Eero P. Simoncelli, and Alan C. Bovik. "Multiscale structural similarity for image quality assessment." The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003. Vol. 2. Ieee, 2003.

[6] Zhang, Richard, et al. "The unreasonable effectiveness of deep features as a perceptual metric." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[7] Mentzer, Fabian, et al. "High-fidelity generative image compression." Advances in Neural Information Processing Systems 33 (2020): 11913-11924.

[8] Agustsson, Eirikur, et al. "Multi-realism image compression with a conditional generator." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[9] Theis, Lucas, et al. "Lossy compression with gaussian diffusion." arXiv preprint arXiv:2206.08889 (2022).

[10] Yang, Ruihan, and Stephan Mandt. "Lossy image compression with conditional diffusion models." arXiv preprint arXiv:2209.06950 (2022).

Semantic communication, as a novel paradigm, centers its focus on the transmission of encoded semantic symbols rather than the raw message. Within this innovative framework, the symbols can encompass a broad spectrum of semantics, ranging from the sentiment conveyed in a sentence to the semantic segmentation map of an image. The recipient then interprets perceivable data from the decoded semantic symbols. This process aims to meet a certain level of Quality of Experience (QoE). In essence, semantic communication can be defined as a goal-oriented, semantic-based, real-time communication system [1].

A substantial portion of Internet traffic is dedicated to real-time visual applications. Therefore, semantic communication plays a crucial role in this domain. In general, the semantic communication literature treats the latent space of a DNN as a notion of implicit semantics [2]. Nevertheless, the latent space as-is demands a high bit rate. A more efficient strategy involves utilizing explicit semantic information rather than relying solely on the implicit latent space. This strategy potentially results in significantly lowered bit rate requirements. Jiang et al. transmitted only facial keypoints for semantic video communication in a head & shoulder teleconferencing setting [3]. But keypoint-based approaches become infeasible for large-scale and general scenes. Huang et al. treat this problem as a semantic image synthesis problem and employed a spatially adaptive generative model to synthesize still frames from losslessly coded semantic maps [4]. The authors showed the possibility of visual semantic communication on a more general setting.

The student is expected to take the information provided in the description as basis and extend it. The literature survey does not have to be limited to applications on communication but it should include semantic frame synthesis methodologies that will be compared and contrasted from the point of view of semantic video communication. Example points to question can be (but not limited to), potential bandwidth requirement and real-time capabilities of the methods.

Supervision:  Cem Eteke (  


[1] Strinati, Emilio Calvanese, and Sergio Barbarossa. "6G networks: Beyond Shannon towards semantic and goal-oriented communications." Computer Networks 190 (2021): 107930.

[2] Grassucci, Eleonora, et al. "Enhancing Semantic Communication with Deep Generative Models--An ICASSP Special Session Overview." arXiv preprint arXiv:2309.02478 (2023).

[3] Jiang, Peiwen, et al. "Wireless semantic communications for video conferencing." IEEE Journal on Selected Areas in Communications 41.1 (2022): 230-244.

[4] Huang, Danlan, et al. "Toward semantic communications: Deep learning-based image semantic coding." IEEE Journal on Selected Areas in Communications 41.1 (2022): 55-71.


Visual SLAM aims to simultaneously track the camera and reconstruct the surrounding environment from visual input. It is an extensively researched area in robotic perception. Recent works in visual SLAM benefit from generative neural networks [1], [2]. In this seminar, we plan to investigate cutting-edge works leveraging generative models proposed for visual SLAM. We will then explore the applications of generative models, especially object removal and background painting [3]. Since a key assumption in visual SLAM is the static environment, which is often not satisfied in scenarios with dynamic elements (humans, pets). We will evaluate the compatibility of incorporating these methods into visual SLAM, with a strong focus on efficiency and robustness.

Supervision:  Xin Su (


[1] Almalioglu, Y., Saputra, M. R. U., De Gusmao, P. P., Markham, A., & Trigoni, N. (2019, May). GANVO: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks. In 2019 International conference on robotics and automation (ICRA) (pp. 5474-5480). IEEE.

[2]  Chakravarty, P., Narayanan, P., & Roussel, T. (2019, May). GEN-SLAM: Generative modeling for monocular simultaneous localization and mapping. In 2019 International conference on robotics and automation (ICRA) (pp. 147-153). IEEE.

[3] Max, Lenord Melvix Joseph Stephen, "Visual Odometry Using Generative Artificial Intelligence", Technical Disclosure Commons, (August 25, 2023).


​Recent advancements in diffusion based models for image generation tasks, show impressive qualitative results and have been frequently used in artistic applications [1]. ​

​Recent works extend their success to other domains and tasks while considering additional aspects like time. Those include efforts to generate future video sequences or generate feasible future procedures or action sequences [2,3].​

Additionally, there are works that investigate the expansion of generative models to a classification setup, which incorporates classifying and segmenting video sequences in the case of action understanding [4,5]. ​

​The student is supposed to understand and summarize the current state-of-the-art approaches in generative models, with a focus on diffusion based models, that are used for the task of action and video understanding. This entails explaining various network structures, training techniques, metrics for evaluation, and comparisons with previous works.​

Supervision: Constantin Patsch (



[2] Wang, Hanlin, et al. "Pdpp: Projected diffusion for procedure planning in instructional videos." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.​

[3] Harvey, William, et al. "Flexible diffusion modeling of long videos." Advances in Neural Information Processing Systems 35 (2022): 27953-27965.​

[4] Li, Alexander C., et al. "Your diffusion model is secretly a zero-shot classifier." arXiv preprint arXiv:2303.16203 (2023).​

[5] Liu, Daochang, et al. "Diffusion action segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

Recently there have been a great leap in generative AI models. Due to the development of image diffusion models like stable diffusion[1]. This  lead to the development of generative models for 3D human motion in tasks like action to motion [2] and text to motion [3] tasks. While older methods used variational auto encoders [2], new approaches are using motion diffusion models [4] which have nice properties like natural support for motion inpainting at test time. Another advancement was the use of quantized feature space which is efficient and robust training method for auto-regressive models [5].​

​On the other hand, generative object interaction models  have seen success in areas like human grasp action generation [6] and object shape based grasp generation [6].​

​The student is tasked with comprehending and summarizing the current state-of-the-art approaches in human and human-object motion generation tasks (action to motion, and text to motion, and optionally image to motion). This will include comparing the different approaches on common datasets while understanding the quantitative metrics and qualitative user studies. The student can optionally run these models against similar input and compare them qualitatively. Finally, the student has to conclude the best performing models and justify the conclusion. 

Supervision: Marsil Zakour (   


[1] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, & Björn Ommer. (2022). High-Resolution Image Synthesis with Latent Diffusion Models.​

[2] Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., & Cheng, L. (2020). Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 2021–2029).​

[3] Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., & Cheng, L. (2022). Generating Diverse and Natural 3D Human Motions From Text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5152-5161).​

[4] Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, & Amit Haim Bermano (2023). Human Motion Diffusion Model. In The Eleventh International Conference on Learning Representations .​

[5] Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., & Yu, G. (2023). Executing your Commands via Motion Diffusion in Latent Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18000–18010) .​

[6] Taheri, O., Choutas, V., Black, M., & Tzionas, D. (2022). GOAL: Generating 4D Whole-Body Motion for Hand-Object Grasping. In Conference on Computer Vision and Pattern Recognition (CVPR).​

[7] Taheri, O., Ghorbani, N., Black, M., & Tzionas, D. (2020). GRAB: A Dataset of Whole-Body Human Grasping of Objects. In Computer Vision – ECCV 2020 (pp. 581–600). Springer