Media Technology Scientific Seminar
Lecturer (assistant) | |
---|---|
Number | 0820906570 |
Type | Advanced seminar |
Duration | 3 SWS |
Term | Sommersemester 2023 |
Language of instruction | German |
Position within curricula | See TUMonline |
Dates | See TUMonline |
Admission information
Objectives
Description
Prerequisites
Teaching and learning methods
Examination
Recommended literature
Links
Umbrella topic for SS23: "From Attention to Transformers: A Journey through Sequence Models."
The kick-off meeting for the seminar is on 19.04.2022 at 13:15 in Seminar Room 0406.
Attendance is mandatory to get a fixed place in the course!
The media technology scientific seminar in this semester is focused on Transformers. The aim is to investigate their potential, advancements and future directions in various domains of application. More details will be provided during the kick-off Meeting.
This study explores the use of pre-training techniques for 3D transformers in order to improve their performance in various 3D vision tasks. Specifically, the study presents a set of pre-training techniques and evaluates their effectiveness on a range of datasets and downstream tasks. The study also investigates the impact of various factors such as the size of the pre-training dataset, the choice of pre-training task, and the architecture of the 3D transformer on the final performance. The results of the study are expected to provide insights into the best practices for pre-training 3D transformers, which can be leveraged to improve the state-of-the-art performance on various 3D vision tasks.
Supervision: Adam Misik (adam.misik@tum.de)
References:
[1] Lu, Dening, et al. "Transformers in 3d point clouds: A survey." arXiv preprint arXiv:2205.07417 (2022).
[2] Hou, Ji, et al. "Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors." arXiv preprint arXiv:2302.14746 (2023).
[3] Yu, Xumin, et al. "Point-bert: Pre-training 3d point cloud transformers with masked point modeling." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
In recent years, many different approaches have been presented to utilize deep learning for image compression. One of the architectures which are interesting for this application are transformers.
For this topic, you will have a look into these transformer-based architectures, starting with those presented in [1] and [2], and compare them to other deep learning approaches.
Supervision: Lars Nockenberg (lars.nockenberg@tum.de)
References:
[1] A. A. Jeny, M. Shah Junayed, and M. B. Islam, “An Efficient End-To-End Image Compression Transformer,” in 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France: IEEE, Oct. 2022, pp. 1786–1790. doi: 10.1109/ICIP46576.2022.9897663.
[2] R. Zou, C. Song, and Z. Zhang, “The Devil Is in the Details: Window-based Attention for Image Compression,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA: IEEE, Jun. 2022, pp. 17471–17480. doi: 10.1109/CVPR52688.2022.01697.
[4] C. Christopoulos, A. Skodras and T. Ebrahimi, "The JPEG2000 still image coding system: an overview," in IEEE Transactions on Consumer Electronics, vol. 46, no. 4, pp. 1103-1127, Nov. 2000, doi: 10.1109/30.920468.
Image generation involves many sub-tasks, including image in-/outpainting, image editing, and the very hot AI topic text-to-image. Previous works often utilize the Generative Adversarial Network (GAN), which is however very difficult to train. Recently, several Transformer-based models show remarkable performance. For instance, Stable Diffusion [1] leverages the diffusion model with the cross-attention layer to generate high-resolution images. The MaskGIT [2] and Muse [3] introduce a bidirectional Transformer for image generation, which is trained by a mask prediction strategy (similar to BERT). ViTGAN [4] attempts to improve the GAN by leveraging Vision Transformer (ViT) as the generator and discriminator in GAN.
In this seminar, we will explore the usage of Transformers and the attention mechanism in the most recent works for image generation. The student is expected to review and compare at least two different approaches.
Supervision: Zhifan Ni (zhifan.ni@tum.de)
References:
[1] Rombach et al, High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684-10695.
[2] Chang et al, MaskGIT: Masked Generative Image Transformer. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11305-11315.
[3] Chang et al, Muse: Text-To-Image Generation via Masked Generative Transformers. ArXiv preprint. URL: https://muse-model.github.io/
[4] Lee et al, ViTGAN: Training GANs with Vision Transformers. 2022 International Conference on Learning Representations (ICLR).
Since the inception of transformers, hence the attention architecture [1], we see a rapid increase in their utilisation in computer vision problems, forming the Vision Transformer (ViT) [2]. As a natural extension ViT architectures found their use in image synthesis [3]. The approaches that tackle image synthesis with transformers can be formalised both in GAN and Diffusion settings. On top of that, with the attention mechanism these models are also shown to be quite powerful in conditional image synthesis [4]. In this topic the student will be expected to dive into these models and compare and contrast how transformers can be used for conditional image synthesis.
Supervision: Cem Eteke (cem.eteke@tum.de)
References:
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[2] Han, Kai, et al. "A survey on vision transformer." IEEE transactions on pattern analysis and machine intelligence 45.1 (2022): 87-110.
[3] Esser, Patrick, Robin Rombach, and Bjorn Ommer. "Taming transformers for high-resolution image synthesis." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.
[4] Rombach, Robin, et al. "High-resolution image synthesis with latent diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
Neural network based methods have been widely explored in the recent development of the image enhancement task, especially transformer architecture thanks to their attention mechanism leveraging the local neighbourhood relations of region of interest on images. High dynamic range (HDR) imaging technique enables better perceptual quality but at the same time requires tone mapping for displaying and colour reproduction on consumer devices. The need to show HDR imagery leads to the development various approaches of transformer-based image enhancement and HDR tone mapping [1]. Lately many researchers explore transformers for HDR image enhancement, in which some proposed lightweight models of image enhancement [2] [3], some optimised on local details and high-level features [4], some explored bidirectional information exchange between global and local features [5]. Other researchers explored HDR tone mapping and inverse tone mapping with different network structures [6].
The student is required to understand and summarise the state-of-the-art approaches of HDR image enhancement and tone mapping.
This will include the explanation of different network structures, optimisation goals, metrics in evaluation and comparison to previous work. The student is encouraged to read and include more related research papers before presenting their findings in a seminar paper and presentation.
Supervision: Hongjie You (hongjie.you@tum.de)
References:
[1] Han, Xueyu, Ishtiaq Rasool Khan, and Susanto Rahardja. "High Dynamic Range Image Tone Mapping: Literature review and performance benchmark." Digital Signal Processing (2023): 104015.
[2] Zhang, Zhaoyang, et al. "Star: A structure-aware lightweight transformer for real-time image enhancement." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
[3] Ma, Tian, et al. "TPE: Lightweight Transformer Photo Enhancement Based on Curve Adjustment." IEEE Access 10 (2022): 74425-74435.
[4] Li, Zinuo, et al. "WavEnhancer: Unifying Wavelet and Transformer for Image Enhancement." arXiv preprint arXiv:2212.08327 (2022).
[5] Zhou, Kun, et al. "Mutual Guidance and Residual Integration for Image Enhancement." arXiv preprint arXiv:2211.13919 (2022).
[6] Yao, Mingde, et al. "Bidirectional translation between uhd-hdr and hd-sdr videos." IEEE Transactions on Multimedia (2023).
Neural rendering is a field of research that aims to generate realistic images and videos using deep learning techniques. [1] Although initially introduced for Natural Language Processing, transformers[2] have also shown promise in modeling the spatial and temporal relationships between pixels or voxels in images and videos. The application of transformers in neural rendering has led to significant advances in tasks such as image synthesis,[3] video prediction,[4] and 3D shape reconstruction[5]. In this topic, the student will be expected to research the impact of transformers in the neural rendering area.
Supervision: Furkan Mert Algan (fmert.algan@tum.de)
References:
[1] Tewari, Ayush, et al. "Advances in neural rendering." Computer Graphics Forum. Vol. 41. No. 2. 2022.
[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[3] Lin, Kai-En, et al. "Vision transformer for nerf-based view synthesis from a single input image." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.
[4] Kwon, Youngjoong, et al. "Neural human performer: Learning generalizable radiance fields for human performance rendering." Advances in Neural Information Processing Systems 34 (2021): 24741-24752.
[5] Reizenstein, Jeremy, et al. "Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Due to the success of Transformer models in NLP, recent Vision models also leverage the Transformers and the underlying self-attention mechanism in order to extract meaningful representations.
Current work also shows that Transformers are suitable for the action anticipation/prediction problem. You will investigate recent methods in applying transformer models to prediction tasks.
Supervision: Constantin Patsch (constantin.patsch@tum.de)
References:
Transformers are intensively researched in many computer vision tasks such as image classifications, segmentation and 3D analysis [1]. In this seminar, we aim to explore the concepts and applications of transformers in visual slam or visual localization tasks. The purpose of this seminar has three folds. Firstly, we will clarify the fundamental concepts behind the success of transformers [2][3], e.g. self-attention. Secondly, we will investigate the state-of-the-art methods adopting transformers for visual SLAM tasks, including feature detection [4], feature matching [5], re-localization [6], etc. Furthermore, we will deeply investigate one of the SOTAs by understanding the network design, metrics, comparison to other methods, as well as analysis on strengths and limitations.
Supervision: Xin Su (xin.su@tum.de)
References:
[1] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., et. al. (2022). Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s), 1-41.
[2] Vaswani, A., Shazeer, N., Parmar, N., et. al.(2017). Attention is all you need. Advances in neural information processing systems, 30.
[3] Dosovitskiy, A., Beyer, L., Kolesnikov, A., et. al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[4] Sun, J., Shen, Z., Wang, Y., et. al. (2021). LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8922-8931).
[5] Xie, T., Dai, K., Wang, K., et. al. (2023). DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching. arXiv preprint arXiv:2301.02993.
[6] Zhang, C., Liwicki, S., & Cipolla, R. (2022). Beyond the CLS Token: Image Reranking using Pre-trained Vision Transformers.
Recent advancements in learned image compression have surpassed traditional hand-engineered algorithms such as JPEG [1], and have even achieved comparable rate-distortion performance with modern video coding standards like VVC [2]. Among the successful approaches, autoencoders based on [3, 4] have shown promising results, where the entropy of the latent elements is jointly modeled and minimized with an image distortion metric. The entropy modeling is based on two principles - backward and forward adaptation [5]. The former employs a hyperprior estimator that utilizes a signaled information source, while the latter implements a context model that utilizes previously decoded symbols for entropy estimation without additional signaling.
One widely used deep learning technique in computer vision is attention, which allows neural networks to selectively focus on relevant parts of the input and suppress irrelevant ones. Unlike convolutional networks, attention-based models such as transformers offer greater adaptivity to input due to their dynamic receptive field. As a result, researchers are actively exploring transformer-based encoder/decoder [6, 7, 8], hyperprior [9, 10], and entropy/context model architectures [10, 11] to develop state-of-the-art learned image compression algorithms.
The student is tasked with comprehending and summarizing the current state-of-the-art approaches in learned image compression algorithms. This entails explaining various network structures, training techniques, metrics for evaluation, and comparisons with previous works. The student is encouraged to conduct extensive research and incorporate relevant research papers to augment their findings, which will be presented in a seminar paper and presentation.
Supervision: A. Burakhan Koyuncu (burakhan.koyuncu@tum.de)
References:
[1] Wallace, Gregory K. "The JPEG still picture compression standard." IEEE transactions on consumer electronics 38.1 (1992): xviii-xxxiv.
[2] “Versatile Video Coding,” Standard, Rec. ITU-T H.266 and ISO/IEC 23090-3, Aug. 2020.
[3] Ballé, Johannes, et al. “Variational image compression with a scale hyperprior”. ICLR 2018
[4] Minnen, David, Johannes Ballé, and George D. Toderici. "Joint autoregressive and hierarchical priors for learned image compression." NeurIPS 2018
[5] Ballé, Johannes, et al. "Nonlinear transform coding." IEEE Journal of Selected Topics in Signal Processing 15.2 (2020): 339-353.
[6] Zou, Renjie, Chunfeng Song, and Zhaoxiang Zhang. "The devil is in the details: Window-based attention for image compression." CVPR 2022.
[7] Lu, Ming, et al. "Transformer-based Image Compression." IEEE DCC 2022.
[8] Lu, Ming, and Zhan Ma. "High-efficiency lossy image coding through adaptive neighborhood information aggregation." arXiv preprint arXiv:2204.11448 (2022).
[9] Kim, Jun-Hyuk, Byeongho Heo, and Jong-Seok Lee. "Joint global and local hierarchical priors for learned image compression." CVPR 2022.
[10] Qian, Yichen, et al. "Entroformer: A Transformer-based Entropy Model for Learned Image Compression." ICLR 2022
[11] Koyuncu, A. Burakhan, et al. "Contextformer: A transformer with spatio-channel attention for context modeling in learned image compression.“ ECCV 2022
The current Transformer model is heavily applied to natural language processing tasks [1] and extended to computer vision tasks. At the same time, advanced vision tasks, such as detecting human-object interactions, have become one of the hot research directions. It requires that we can process not only semantic information in a single image, but also classify human-object interactions. Furthermore, we also need to find the temporal relationship between consecutive frames in the video. And this requirement qualifies the application of the Transformer [2][3].
In this topic, you need to review the provided papers and extend them through your own studying. While understanding the basic architecture of the transformer, you need to explore the possibility of optimizing the transformer architecture to enhance the task of detecting human-object interaction classification.
Supervision: Yuankai Wu (yuankai.wu@tum.de)
References:
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[2] Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei and C. Chen, "Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 19526-19535, doi: 10.1109/CVPR52688.2022.01894.
[3] A. S. M. Iftekhar, H. Chen, K. Kundu, X. Li, J. Tighe and D. Modolo, "What to look at and where: Semantic and Spatial Refined Transformer for detecting human-object interactions," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 5343-5353, doi: 10.1109/CVPR52688.2022.00528.
With the success of transformer models on natural language processing tasks, researchers wanted to find ways to integrate them in computer vision domain. The initial Vision Transformer [1] architecture was an important milestone in this search. However, this model required quadratic memory in number of tokens, resulting in huge memory need for processing high resolution images. To get around this problem, one team developed Swin Transformers [2] to process images in windows of patches and shifting the windows to let information flow between different segments of the image. Another successful approach was to introduce Pyramid Vision Transformers [3] and Multiscale Vision Transformers [4] so the image size can be reduced between stages using downsampling techniques. The developers of Multiscale Vision Longformers [5] tried to solve the same problem using more efficient attention mechanisms and a global memory. HRFormer [6] proposed to use convolutional layers integrated in their transformer blocks to create a model that can efficiently attend distant segments of the input image. In this seminar, we’ll dive deep into the literature of high-resolution image processing using vision transformers and we’ll learn about useful techniques that can be applied in real world tasks such as object detection, image segmentation, image classification and so on.
Supervision: Hasan Burak Dogaroglu (burak.dogaroglu@tum.de)
References:
[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. DOI:https://doi.org/10.48550/ARXIV.2010.11929
[2] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. DOI:https://doi.org/10.48550/ARXIV.2103.14030
[3] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021 IEEE/CVF International Conference on Computer Vision (ICCV). DOI:https://doi.org/10.1109/iccv48922.2021.00061
[4] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021. Multiscale Vision Transformers. DOI:https://doi.org/10.48550/ARXIV.2104.11227
[5] Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. 2021. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding. DOI:https://doi.org/10.48550/ARXIV.2103.15358 [6] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. 2021. HRFormer: High-Resolution Transformer for Dense Prediction. DOI:https://doi.org/10.48550/ARXIV.2110.09408
Vision-language models based on transformer architectures have recently emerged for various applications. In this seminar topic, we investigate recent advances in the field of visual grounding, which consists in localizing objects in the visual domain (image/point cloud) using natural language prompts.
Supervision: Martin Piccolrovazzi (martin.piccolrovazzi@tum.de)
References:
[1] Li et al. Grounded Language-Image Pre-training (CVPR 2022)
[2] Zhang et al. Glipv2: Unifying localization and vision-language understanding (NeurIPS 2022)