Semantic Segmentation using Deep Upconvolutional Neural Networks

DeepScene contains models trained on various datasets, both unimodal and multimodal data. The demo currently supports urban street scenes and forested environments. Select a dataset and a corresponding model to load from the drop down box below, and click on Random Example or upload your own image to see the live segmentation results. To learn more about the model acronyms used, please see the Technical Approach section.

Input Image

Segmentation Output

  • Building

  • Rider/Bike

  • Car/Truck/Bus

  • Fence

  • Void

  • Sky

  • Traffic Sign

  • Person

  • Pole

  • Vegetation

  • Sidewalk

  • Road


 Or upload an image:

Technical Approach

DeepScene is a set of upconvolutional neural network models trained end-to-end for robust pixelwise semantic segmentation. This is being developed by members from the Autonomous Intelligent Systems group and the Computer Vison group at the University of Freiburg, Germany.

architecture upnet
UpNet Architecture

Currently there are two base model architectures: UpNet and our recently proposed AdapNet. The website currently shows RGB, Depth and CMoDE models trained using AdapNet, and others using the UpNet architecture. The UpNet architecture has upconvolutional layers of size C×Ncl, where Ncl is the number of classes and C is a scalar factor of filter augmentations. The contractive segment of the network contains convolution and pooling layers, while the expansive segment of the network contains upsampling and convolution layers.

architecture adaptive net
AdaptNet Architecture

Our AdapNet architecture follows the general principle of having a contractive and an expansive segment, similar to the FCN architectures. In contrast to the previous approaches, for the contractive segment, we adapt the recently proposed ResNet-50. The ResNet architecture includes batch normalization and layers that can skip convolutions. This allows the design of much deeper networks without facing degradation of the gradient and therefore leads to very large receptive fields with often highly discriminative features. The output of this contractive segment is 32-times downsampled with respect to the input. We then upsample the output of this contractive segment using deconvolutions and perform refinement by fusing high resolution feature maps from the contractive segment. In addition we incorporate several improvements proposed in the ICRA'17 paper, such as Multiscale ResNet blocks, Front Convolutions and Higher Resolution Outputs. Please find detailed description of the architecture on the ICRA'17 paper.

The lists below show the currently available models for demo on this website:

Unimodal Models

  • RGB
  • Depth
  • NIR (Near-Infrared)
  • NRG (Near-Infrared, Red, Green)
  • EVI (Enhanced Vegitation Index)
  • NDVI (Normalized Difference Vegetation Index)

Multimodal Models

  1. Channel-Stacking (CS)
    • Three channel - RGB, NIR, Depth
    • Four channel - RGB, NIR
    • Five channel - RGB, NIR, DEPTH
  2. Late-Fused Convolution (LFC)
    • RGB-Depth
    • NRG-Depth
    • RGB-EVI
    • RGB-NIR
  3. Convoluted Mixture of Deep Experts (CMoDE)
    • RGB-Depth
    • RGB-EVI

Channel-Stacking (CS) Model

architecture cf
CS Architecture

The most intuitive paradigm of fusing data using Deep Convolutional Neural Networks is by stacking them into multiple channels and learning combined features end-to-end. However, previous efforts have been unsuccessful due to the difficulty in propagating gradients through the entire length of the model.

Late-Fused Convolution (LFC) Model

In the late-fused-convolution approach, each model is first learned to segment using a specific spectrum/modality. Afterwards, the feature maps are summed up element-wise before a series of convolution, pooling and upconvolution layers.

This approach has the advantage as features in each model may be good at classifying a specific class and combining them may yield a better throughput, even though it necessitates heavy parameter tuning.

architecture lfc
LFC Architecture

Convoluted Mixture of Deep Experts (CMoDE) Model

This model fuses multiple modalities or spectra for pixel-wise semantic segmentation and has two components: the experts that map particular modalities to the segmentation output and the adaptive gating network that learns “how much” and “when” to rely on each expert.

We train the network to learn the convex combination of the experts by back-propagating into the weights, similar to any other synapse weight or convolutional kernel.

architecture cmode
CMoDE Architecture



The Freiburg Forest dataset was collected using our Viona autonomous mobile robot platform equipped with cameras for capturing multi-spectral and multi-modal images. The dataset may be used for evaluation of different perception algorithms for segmentation, detection, classification, etc. All scenes were recorded at 20 Hz with a camera resolution of 1024x768 pixels. The data was collected on three different days to have enough variability in lighting conditions as shadows and sun angles play a crucial role in the quality of acquired images. The robot traversed about 4.7 km each day. We provide manually annotated pixel-wise ground truth segmentation masks for 6 classes: Obstacle, Trail, Sky, Grass, Vegetation, and Void.

For each spectrum/modality, we provide one zip file containing all the sequences. Each sequence is a continous stream of camera frames. All the multi-spectral images are in the PNG format and the depth images are in the 16-bit TIFF format. For the evaluations mentioned in the paper, we provide two text files containing the train and test splits. If you would like to contribute to the annotations, please contact us. More details and evaluations can be found in our papers listed under publications.


Please cite our work if you use the Freiburg Forest Dataset or report results based on it.

author = {Abhinav Valada and Gabriel Oliveira and Thomas Brox and Wolfram Burgard},
title = {Deep Multispectral Semantic Scene Understanding of Forested Environments using Multimodal Fusion},
booktitle = {The 2016 International Symposium on Experimental Robotics (ISER 2016)},
year = 2016,
month = oct,
url = {},
address = {Tokyo, Japan}

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the Freiburg Forest datasets, please consider citing the first paper mentioned under publications.


Freiburg Forest Raw

The raw dataset contains over 15,000 images of unstructued forest environments, captured at 20Hz using our Viona autonomous robot platform equipped with a Bumblebee2 stereo vision camera.

Freiburg Forest Multi-Modal/Spectral Annotated

The dataset contains the following multi-modal/spectral images with groundtruth annotations: RGB, Depth, NIR, NRG, NDVI, EVI and their variants. Pixel-level annotations are provided for 6 semantic classes: Trail, Grass, Vegetation, Obstacle, Sky, Void.

Video Demos


Modality 1

Modality 2

Segmentation Output

Motion Blur
Snow & Shadows


  • Abhinav Valada, Johan Vertens, Ankit Dhall, Wolfram Burgard
    AdapNet: Adaptive Semantic Segmentation in Adverse Environmental Conditions
    Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 2017.

  • Abhinav Valada, Ankit Dhall, Wolfram Burgard
    Convoluted Mixture of Deep Experts for Robust Semantic Segmentation
    IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Workshop, State Estimation and Terrain Perception for All Terrain Mobile Robots, Daejeon, Korea, 2016.

  • Abhinav Valada, Gabriel L. Oliveira, Thomas Brox, Wolfram Burgard
    Deep Multispectral Semantic Scene Understanding of Forested Environments using Multimodal Fusion
    The International Symposium on Experimental Robotics (ISER), Tokyo, Japan, 2016.

  • Abhinav Valada, Gabriel L. Oliveira, Thomas Brox, Wolfram Burgard
    Robust Semantic Segmentation using Deep Fusion
    Robotics: Science and Systems (RSS) Workshop, Limits and Potentials of Deep Learning in Robotics, Ann Arbor, USA, 2016.

  • Gabriel L. Oliveira, Abhinav Valada, Wolfram Burgard, Thomas Brox
    Deep Learning for Human Part Discovery in Images
    Proceedings of the IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 2016.
  • People