Densely Annotated Video Driving (DAVID) Data Set

The Densely Annotated Video Driving (DAVID) data set consists of 28 video sequences of urban driving recorded in the CARLA simulator. In contrast to real world data sets such as Cityscapes, the DAVID data set provides the ground truth pixel-wise semantic class labels for every single frame. The data set is intended to further facilitate research of semantic video segmentation by providing a diverse and large-scale video corpus. It contains 28 video sequences that consist of a total of 10,767 frames. An equal number of 10,767 pixel-wise semantic label maps are provided. The videos were recorded at a frame rate of 10 Hz. The average sequence duration is 38.4 seconds. Half the sequences were recorded in sunny weather, nine sequences were recorded in rain and the remaining five were recorded in cloudy conditions. The recorded driving scenarios include regular driving, traffic jams as well as stopping and starting at traffic lights. A more detailed description is available when downloading the data set. When using the data set, please cite our paper "Pixel-Wise Failure Prediction for Semantic Video Segmentation" published at IEEE ICIP 2021 where the data set was first introduced.

Link to the data set at mediaTUM

Multi-View Region of Interest Prediction for Autonomous Driving

Visual environment perception is one of the key elements for autonomous and manual driving. Modern fully automated vehicles are equipped with a range of different sensors and capture the surroundings with multiple cameras. The ability to predict human driver’s attention is the basis for various autonomous driving functions. State-of-the-art attention prediction approaches use only a single front facing camera and rely on automatically generated training data. In this paper, we present a manually labeled multi-view region of interest dataset. We use our dataset to finetune a state-of-the-art region of interest prediction model for multiple camera views. Additionally, we show that using two separate models focusing on either front or rear view data improves the region of interest prediction. We further propose a semi-supervised annotation framework which uses the best performing finetuned models for generating pseudo labels to improve the efficiency of the labeling process. Our results show that existing region of interest prediction performs well on front view data, but finetuning improves the performance especially for rear view data. Our current dataset consists of about 16000 images and we plan to further increase the size of the dataset. The dataset and the source code of the proposed semi-supervised annotation framework will be made available on GitHub and can be used to generate custom region of interest data.

Link to the dataset at mediaTUM

Room segmentation in point clouds

Emerging applications, such as indoor navigation or facility management, present new requirements of automatic and robust partitioning of indoor 3D point clouds into rooms. Previous research is either based on the Manhattan-world assumption or relies on the availability of the scanner pose information. We address these limitations by following the architectural definition of a room, where the room is an inner free space separated from other spaces through openings or partitions. For this we formulate an anisotropic potential field for 3D environments and illustrate how it can be used for room segmentation in the proposed segmentation pipeline. The experimental results confirm that our method outperforms state-of-the-art methods on a number of datasets including those that violate the Manhattan-world assumption.

LMT Texture Database

While stroking a rigid tool over an object surface, vibrations induced on the tool, which represent the interaction between the tool and the surface texture, can be measured by means of an accelerometer. Such acceleration signals can be used to recognize or to classify object surface textures. The temporal and spectral properties of the acquired signals, however, heavily depend on different parameters like the applied force on the surface or the lateral velocity during the exploration. Robust features that are invariant against such scan-time parameters are currently lacking, but would enable texture classification and recognition using uncontrolled human exploratory movements. We introduce a haptic texture database which allows for a systematic analysis of feature candidates. The database includes recorded accelerations measured during controlled and well-defined texture scans, as well as uncontrolled human free hand texture explorations for 69 different textures.

A dataset of thin-walled deformable objects

Datasets of object models with many variants of each object are required for manipulation and grasp planning using machine learning and simulation methods. This work presents a parametric model generator for thin-walled deformable or solid objects found in household scenes, such as bottles, glasses and other containers. Two datasets are provided that resemble real objects and contain a large number of variants of realistic bottles.

Video Synchronization Benchmark

This website provides a collection of user generated multi-viewpoint video sets (i.e. casual recordings of isolated events from multiple perspectives). Its purpose is to facilitate an objective performance evaluation of different video synchronization algorithms. The video collection covers 43 distinct events, recorded from 2 to 5 viewpoints each. In total, there are 164 video pairs whose relative temporal offsets are to be determined. All videos have been recorded with consumer grade cameras (camcorders and mobile phones) and under realistic conditions (shaking cameras, unconstrained viewpoints, etc.), rendering fully automatic synchronization a challenging task.