Media Technology Scientific Seminar
| Lecturer (assistant) | |
|---|---|
| Number | 0820906570 |
| Type | advanced seminar |
| Duration | 3 SWS |
| Term | Sommersemester 2026 |
| Language of instruction | German |
| Position within curricula | See TUMonline |
| Dates | See TUMonline |
- 15.04.2026 13:15-14:45 0406, Seminarraum
- 22.04.2026 13:15-14:45 0406, Seminarraum
- 06.05.2026 13:15-14:45 0406, Seminarraum
- 13.05.2026 13:15-14:45 0406, Seminarraum
- 20.05.2026 13:15-14:45 0406, Seminarraum
- 27.05.2026 13:15-14:45 0406, Seminarraum
- 03.06.2026 13:15-14:45 0406, Seminarraum
- 10.06.2026 13:15-14:45 0406, Seminarraum
- 17.06.2026 13:15-14:45 0406, Seminarraum
- 24.06.2026 13:15-14:45 0406, Seminarraum
- 01.07.2026 13:15-14:45 0406, Seminarraum
- 08.07.2026 13:15-14:45 0406, Seminarraum
- 15.07.2026 13:15-14:45 0406, Seminarraum
Admission information
Objectives
Description
Prerequisites
Teaching and learning methods
The main teaching methods are:
- Computer-based presentations by the student
- The students mainly work with high quality and recent scientific publications
Examination
- Interaction with the supervisor and working attitude (20%)
- Presentation (30 minutes) and discussion (15 minutes) (50%)
Recommended literature
Links
Embodied world models are emerging as a central paradigm for robotic learning, enabling agents to acquire predictive, action-conditioned representations of their environment that support planning and control [1]. In the context of learning from demonstration (LfD), these models offer a principled way to infer latent dynamics and behavioral structure directly from expert trajectories, reducing reliance on explicit reward design and large-scale interaction. By integrating perception and action within a shared predictive framework, embodied world models allow robots to generalize demonstrated behaviors to novel situations and to reason about the consequences of their actions in both real and simulated environments.
Recent advances have focused on combining self-supervised representation learning with generative world modeling, particularly through video-based approaches that leverage scalable visual data. These models learn from offline demonstrations by predicting future observations conditioned on actions, enabling imitation, trajectory synthesis, and planning in latent space. Key challenges include extracting actionable structure from passive demonstrations, maintaining temporal consistency and causal reasoning over long horizons, inferring dense rewards from sparse signals, embodiment-agnostic learning, and achieving efficient inference suitable for real-time robotic deployment. Approaches based on joint embedding predictive architectures and self-supervised visual pretraining have shown promise in addressing these challenges by learning compact, transferable representations of environment dynamics [2-4].
During the seminar, your tasks will include surveying and critically evaluating state-of-the-art methods in embodied world models for robotic learning from demonstration. You will be expected to explain how action-conditioned predictive models can be trained from or conditioned on demonstration data, analyze the role of latent representations for policy generalization and planning, and assess how these approaches compare to classical imitation learning and reinforcement learning baselines. Your presentation should further examine applications in robotics and autonomous systems, and conclude with an outlook on open challenges and future research directions in scalable, data-driven robot learning.
Supervision: Cem Eteke (cem.eteke@tum.de)
References:
[1] Dong, Jiahua, et al. "Learning to model the world: A survey of world models in artificial intelligence." (2027).
[2] Zhou, Gaoyue, et al. "Dino-wm: World models on pre-trained visual features enable zero-shot planning." arXiv preprint arXiv:2411.04983 (2024).
[3] Assran, Mido, et al. "V-jepa 2: Self-supervised video models enable understanding, prediction and planning." arXiv preprint arXiv:2506.09985 (2026).
[4] Zheng, Ruijie, et al. "Flare: Robot learning with implicit world modeling." arXiv preprint arXiv:2505.15659 (2026).
Traditional 3D computer vision models are restricted to closed-set vocabularies. They can only recognize predefined categories (e.g., car, pedestrian, chair) and require massive amounts of manually annotated 3D data. However, the real physical world is infinitely diverse. Open-vocabulary 3D scene understanding solves this bottleneck by extracting the generalized knowledge from 2D Foundation Models (like CLIP [1] or SAM [2], trained on billions of internet images) and mathematically lifting or distilling it into 3D space. This capability is a massive leap forward for Embodied AI and Autonomous Systems. It allows a household robot to locate a "spilled cup of coffee" or an autonomous vehicle to safely navigate around an "unusual construction debris object," even if the system has never been explicitly trained on those specific 3D items.
While OpenScene [3] established the classic distillation paradigm by projecting and fusing 2D features from 2D foundation models into 3D point clouds, recent works [4][5] are focusing more on transition open-vocabulary understanding to the modern Neural Radiance Field (NeRF) or 3D Gaussian Splatting (3DGS) architecture. Further works [6] integrate object functional knowledge into the 3D scene, allowing robots to interactive with objects more intelligently.
In this seminar, we will review the latest state-of-the-art papers in this specific domain. We will compare their core contributions and explore potential architectural improvements or novel application pipelines for future intelligent systems.
Supervision: Zhifan Ni (zhifan.ni@tum.de)
References:
[1] A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in PMLR 139:8748-8763, 2021.
[2] N. Carion et al., “SAM 3: Segment Anything with Concepts,” in ICLR 2027.
[3] S. Peng et al., “OpenScene: 3D Scene Understanding with Open Vocabularies,” in CVPR 2023.
[4] J. Kerr et al., “LERF: Language Embedded Radiance Fields,” in ICCV 2023.
[5] M. Qin et al., “LangSplat: 3D Language Gaussian Splatting,” in CVPR 2024.
[6] C. Zhang, “Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces,” in CVPR 2026.
Recent progress in diffusion models and large foundation models has opened a promising new direction for image and video compression, especially at very low bitrates. Instead of relying only on compact latent representations, these methods exploit strong generative priors to reconstruct visually plausible content from limited transmitted information [1,2,3,4,5].
This seminar topic focuses on recent compression approaches that combine latent coding with diffusion-based or foundation-model-based reconstruction. Conditional diffusion models can use compressed latent variables to preserve image content while synthesizing texture at the decoder [1]. Large foundation models can further improve ultra-low bitrate compression by injecting multimodal knowledge, such as visual and language priors, into the reconstruction process [2]. In video compression, diffusion models can also exploit temporal context from previous frames to improve perceptual quality and decoding efficiency [3]. Recent work additionally shows that one-step diffusion can make such methods much more practical for real-time use [4].
During the seminar, your tasks will include studying and comparing these recent approaches, explaining their main principles, and discussing the trade-off between bitrate, distortion, perceptual quality, and decoding complexity. You should also identify open research directions, such as embedding compression, token pruning, efficient attention mechanisms, and foundation-model-driven video compression. The goal is to provide an overview of this emerging research area and its future potential.
Supervision: Serdar Caglar (serdar.caglar@tum.de)
References:
[1] Yang, R., & Mandt, S. (2023). Lossy image compression with conditional diffusion models. Advances in Neural Information Processing Systems, 36, 64971-64995.
[2] Gao, J., Huang, Z., Mao, Q., Ma, S., & Jia, C. (2026). Exploring multimodal knowledge for image compression via large foundation models. IEEE Transactions on Image Processing.
[3] Ma, W., & Chen, Z. (2026). Diffusion-based perceptual neural video compression with temporal diffusion information reuse. ACM Transactions on Multimedia Computing, Communications and Applications, 21(12), 1-22.
[4] Zhang, T., Luo, X., Li, L., & Liu, D. (2026). Stablecodec: Taming one-step diffusion for extreme image compression. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 17379-17389).
[5] Yang, Y., & Mandt, S. (2027). Advances in Diffusion-Based Generative Compression. arXiv preprint arXiv:2601.18932.
Learned image compression has become a powerful alternative to traditional codecs by leveraging neural networks to optimize rate–distortion performance. Most existing approaches are based on autoencoder architectures combined with entropy models, which achieve strong compression efficiency but are typically tailored to specific data distributions and lack generalization capability [1].
Foundation models, trained on large-scale image datasets using self-supervised objectives, offer a promising direction for more generalizable compression. By learning rich and transferable visual representations, they can be adapted to downstream compression tasks and improve robustness under distribution shifts while reducing the need for extensive retraining [2, 3].
In parallel, world models based on generative approaches, such as diffusion or autoregressive models, reinterpret compression as a conditional generation problem. Instead of transmitting all visual details, compact representations are used to guide reconstruction, where the model generates perceptually plausible content based on learned data priors [4].
3During the seminar, you will survey recent advances in learned image compression, analyze the role of foundation and generative models, and compare them with classical codecs (e.g., JPEG, VVC). The seminar will conclude with a discussion of open challenges, including computational complexity, perceptual quality, and practical deployment.
Supervision: Zongxie Chen (zongxie.chen@tum.de)
References:
[1] Minnen, D., Ballé, J., & Toderici, G. D. (2018). Joint autoregressive and hierarchical priors for learned image compression. Advances in neural information processing systems, 31.
[2] Gao, J., Huang, Z., Mao, Q., Ma, S., & Jia, C. (2026). Exploring multimodal knowledge for image compression via large foundation models. IEEE Transactions on Image Processing.
[3] Shen, R., Wu, H., Zhang, W., Hu, J., & Gunduz, D. (2026, August). Compression beyond pixels: Semantic compression with multimodal foundation models. In 2026 IEEE 35th International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.
[4] Relic, L., Azevedo, R., Gross, M., & Schroers, C. (2024, September). Lossy image compression with foundation diffusion models. In European Conference on Computer Vision (pp. 303-319). Cham: Springer Nature Switzerland.
Tactile sensing is an essential yet underutilized modality in intelligent systems, providing information about physical interactions such as force, texture, and contact dynamics. Tactile perception provides direct contact information during manipulation, making it valuable for robotics applications in contact-rich tasks. As AI systems move toward embodied intelligence, integrating tactile perception into learning frameworks is becoming increasingly important [1].
Recent approaches in foundation models for robot learning, Vision-Language-Action Models (VLAs), aim to learn policies that generalize across diverse tasks, objects, embodiments, and environments [2]. Tactile-informed foundation models offer a promising pathway for tactile-aware robotic manipulation, particularly in contact-rich and dexterous manipulation tasks. The integration of tactile sensing into multimodal foundation models also shows promise in physical grounding and robustness [1, 3, 4].
During the seminar, you will survey recent approaches in foundation models that incorporate tactile information, analyze how tactile information contributes to learning contact-rich manipulation policies, and compare these approaches to earlier methods of vision-tactile robot learning (impact of foundation models) and robot learning without tactile information (impact of tactile information). Your survey should also explore challenges such as multimodal fusion, data scarcity, sensor and actuator variability, and discuss the future research directions toward fully embodied, tactile-aware intelligent systems.
Supervision: Emre Faik Gökçe (emrefaik.goekce@tum.de)
References:
[1] Huang, Jialei, et al. "Tactile-VLA: unlocking vision-language-action model's physical knowledge for tactile generalization." arXiv preprint arXiv:2507.09160 (2026).
[2] Kawaharazuka, Kento, et.al, "Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications," in IEEE Access, vol. 13, pp. 162467-162504, 2026.
[3] Cheng, Zhengxue, et al. "Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing." arXiv preprint arXiv:2508.08706 (2026).
[4] Zhang, Kaidi, et al. "TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation." arXiv preprint arXiv:2603.12665 (2027).
Indoor localization using radio frequency (RF) signals is a key research direction for emerging 6G intelligent systems, enabling precise positioning in GPS-denied environments such as factories, hospitals, and smart buildings. Traditional RF-based approaches treat multipath propagation — the phenomenon whereby signals reach the receiver via multiple reflected and scattered paths — as a source of interference degrading accuracy. A growing body of research instead reframes multipath components (MPCs) as rich geometric information carriers that implicitly encode the structure of the propagation environment [1].
Foundation models (large, pre-trained neural networks that can be fine-tuned for specific downstream tasks) have recently emerged as a powerful paradigm for representing wireless channels. Self-supervised pre-training on unlabeled channel impulse responses (CIRs) or channel state information (CSI) enables models to learn generalizable representations of the radio environment, substantially reducing the need for expensive labeled measurement campaigns [1]. Complementarily, digital twins (DTs) — physics-consistent virtual replicas of the radio environment generated via ray tracing — serve as world models that capture how electromagnetic waves interact with the geometry of indoor spaces and can dramatically reduce real-world data collection overhead for positioning systems [2,3].
During the seminar, your tasks will include surveying and critically evaluating the state of the art in foundation models and world models applied to RF-based indoor localization. You will be expected to explain the key principles underlying self-supervised pre-training on radio channel data, analyze the role of digital twin-based environment representations for training data generation and system adaptation, and assess how these approaches compare to traditional fingerprinting baselines. Your presentation should conclude with an outlook on open challenges and promising research directions in the context of 6G positioning.
Supervision: Majdi Abdmoulah (Majdi.abdmoulah@tum.de)
References:
[1] K. Witrisal et al., “High-accuracy localization for assisted living: 5G systems will turn multipath channels from foe to friend,” IEEE Signal Process. Mag., vol. 33, no. 1, pp. 59–70, Jan. 2016.
[2] J. Ott, G. Pirkl, and P. Lukowicz, “Radio Foundation Models: Pre-training Transformers for 5G-based Indoor Localization,” in Proc. IPIN, Oct. 2024. arXiv:2410.00617.
[3] L. U. Khan et al., “Digital Twin of Wireless Systems: Overview, Taxonomy, Challenges, and Opportunities,” IEEE Commun. Surveys Tuts., vol. 24, no. 4, pp. 2230–2254, 2022.
[4] A. Alkhateeb, S. Jiang, and G. Charan, “Real-time digital twins: Vision and research directions for 6G and beyond,” IEEE Commun. Mag., vol. 61, no. 11, pp. 128–134, Nov. 2023.
Eye gaze — where are you looking? — is a rich behavioral signal with applications spanning human-computer interaction, assistive technology and extended reality. State-of-the-art wearable trackers rely on infrared cameras and corneal reflection geometry ─ an approach that is sensitive to illumination conditions and computationally demanding in a constrained form factor [1]. Ultrasonic sensing, wherein acoustic pulses are transmitted toward the eye and echoes are collected across a sparse array of MEMS transducers, offers a compelling alternative that is illumination-robust and inherently suited to wearable integration [2].
The dominant paradigm — appearance-based deep learning — has matured considerably, with architectures evolving from single-eye CNNs to multi-stream networks that fuse face and eye features via attention mechanisms and transformer-based designs. Foundation models have entered this space recently: Gaze-LLE [3]. A parallel line of model-based research constructs explicit world models of eye geometry: 3D morphable eye region models, built from high-quality head scans and combined with anatomy-based eyeball geometry, providing physically interpretable representations that generalizes across illumination conditions and head poses [4].
During the seminar, you will survey and critically evaluate deep learning architectures for gaze estimation, from CNNs to foundation model. You will be expected to explain the principles underlying 3D morphable eye models and their role in physics-informed world models for data synthesis and domain adaption. A particular focus should be on the critical analysis of what architectural choices, pre-training strategies, and geometric priors transfer across sensing modalities, and which must be redesigned when the input is no longer an image but an acoustic transfer function [5]. Your presentation should conclude with an outlook to open challenges and promising research directions in the context of audio-visual gaze estimation.
Supervision: : Gautam Vishwapriya (vishwapriyagautam@tum.de)
References:
[1] S. Ghosh et al., “Automatic Gaze Analysis: A Survey of Deep Learning Based Approaches,” IEEE TPAMI, vol. 46, no. 1, pp. 61-84, 2024.
[2] A. Golard, S.S. Talathi, “Ultrasound for Gaze Estimation — A modelling and Empirical Study,” Sensors, vol. 13, no. 4502, 2021.
[3] F. Ryan et al., “Gaze-LLE: Gaze Target Estimation via Large Scale Learned Encoders,” CVPR, pp. 28874-28884, 2026.
[4] Wood et al., “A 3D Morphable Eye Region Model for Gaze Estimation,” in Proc. EECV, pp. 297-313, 2016.
[5] R. Liu et al., “PnP-GA+:Plug-and-Play Domain Adaptation for Gaze Estimation Using Model Variants,” IEEE TPAMI, vol. 46, no. 5, pp. 3707-3721, 2024