Semantic Segmentation using Deep Upconvolutional Neural Networks
DeepScene is a set of upconvolutional neural network models trained end-to-end for robust pixelwise semantic segmentation. This is being developed by members from the Autonomous Intelligent Systems group and the Computer Vison group at the University of Freiburg, Germany.
Currently there are two base model architectures: UpNet and our recently proposed AdapNet. The website currently shows RGB, Depth and CMoDE models trained using AdapNet, and others using the UpNet architecture. The UpNet architecture has upconvolutional layers of size C×Ncl, where Ncl is the number of classes and C is a scalar factor of filter augmentations. The contractive segment of the network contains convolution and pooling layers, while the expansive segment of the network contains upsampling and convolution layers.
Our AdapNet architecture follows the general principle of having a contractive and an expansive segment, similar to the FCN architectures. In contrast to the previous approaches, for the contractive segment, we adapt the recently proposed ResNet-50. The ResNet architecture includes batch normalization and layers that can skip convolutions. This allows the design of much deeper networks without facing degradation of the gradient and therefore leads to very large receptive fields with often highly discriminative features. The output of this contractive segment is 32-times downsampled with respect to the input. We then upsample the output of this contractive segment using deconvolutions and perform refinement by fusing high resolution feature maps from the contractive segment. In addition we incorporate several improvements proposed in the ICRA'17 paper, such as Multiscale ResNet blocks, Front Convolutions and Higher Resolution Outputs. Please find detailed description of the architecture on the ICRA'17 paper.
The most intuitive paradigm of fusing data using Deep Convolutional Neural Networks is by stacking them into multiple channels and learning combined features end-to-end. However, previous efforts have been unsuccessful due to the difficulty in propagating gradients through the entire length of the model.
In the late-fused-convolution approach, each model is first learned to segment using a specific spectrum/modality. Afterwards, the feature maps are summed up element-wise before a series of convolution, pooling and upconvolution layers.
This approach has the advantage as features in each model may be good at classifying a specific class and combining them may yield a better throughput, even though it necessitates heavy parameter tuning.
This model fuses multiple modalities or spectra for pixel-wise semantic segmentation and has two components: the experts that map particular modalities to the segmentation output and the adaptive gating network that learns “how much” and “when” to rely on each expert.
We train the network to learn the convex combination of the experts by back-propagating into the weights, similar to any other synapse weight or convolutional kernel.
The Freiburg Forest
dataset was collected using our Viona autonomous mobile robot
platform equipped with cameras for capturing multi-spectral and
multi-modal images. The dataset may be used for evaluation of
different perception algorithms for segmentation, detection,
classification, etc. All scenes were recorded at 20 Hz with a camera
resolution of 1024x768 pixels. The data was collected on three
different days to have enough variability in lighting conditions as
shadows and sun angles play a crucial role in the quality of
acquired images. The robot traversed about 4.7 km each day. We
provide manually annotated pixel-wise ground truth segmentation masks for 6 classes:
Obstacle, Trail, Sky, Grass, Vegetation, and Void.
For each spectrum/modality, we provide one zip file containing all the sequences. Each sequence is a continous stream of camera frames. All the multi-spectral images are in the PNG format and the depth images are in the 16-bit TIFF format. For the evaluations mentioned in the paper, we provide two text files containing the train and test splits. If you would like to contribute to the annotations, please contact us. More details and evaluations can be found in our papers listed under publications.
Please cite our work if you use the Freiburg Forest Dataset or report results based on it.
@InProceedings{Valada_2016_ISER,
author = {Abhinav Valada and Gabriel Oliveira and Thomas Brox and Wolfram Burgard},
title = {Deep Multispectral Semantic Scene Understanding of Forested Environments using Multimodal Fusion},
booktitle = {The 2016 International Symposium on Experimental Robotics (ISER 2016)},
year = 2016,
month = oct,
url = {http://ais.informatik.uni-freiburg.de/publications/papers/valada16iser.pdf},
address = {Tokyo, Japan}
}
The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the Freiburg Forest datasets, please consider citing the first paper mentioned under publications.
The raw dataset contains over 15,000 images of unstructued forest environments, captured at 20Hz using our Viona autonomous robot platform equipped with a Bumblebee2 stereo vision camera.