DFormer:Rethinking RGBD Representation Learning
for Semantic Segmentation

ICLR 2024


Bowen Yin, Xuying Zhang, Zhongyu Li, Liu Li, Ming-Ming Cheng, Qibin Hou

VCIP, Nankai University

TL;DR: DFormer provide a RGBD pretraining framework to learn transferable RGB-D representations and can be applied to various RGBD downstream tasks.
DFormer is a baseline RGBD encoder, you can pretrain more strong RGBD encoders via the RGBD pretraining code to boost the RGBD researchs.

Abstract


We present DFormer, a novel RGB-D pretraining framework to learn expressive representations for RGB-D segmentation tasks. DFormer has two new renovations: 1) Unlike previous works that aim to encode RGB features, DFormer comprises a sequence of RGB-D blocks, which are tailored for encoding both RGB and depth information through a novel attention module. 2) We pretrain the backbone using image-depth pairs from ImageNet-1K, and thus the DFormer is endowed with the capacity of encoding RGB-D representations. It avoids the mismatched encoding of the 3D geometry relationships in depth maps by RGB pre-trained backbones, which widely lies in existing methods but has not been resolved. We fine-tune the pre-trained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieve a new state-of-the-art performance on these two tasks with lower computation cost compared to previous SOTA methods, e.g., 57.2% mIoU on the NYUv2 dataset with 39.0M parameters and 65.7G Flops.

Overview Overview

The RGBD pretraining framework in DFormer.


Citation


                @article{yin2023dformer,
                  title={DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation},
                  author={Yin, Bowen and Zhang, Xuying and Li, Zhongyu and Liu, Li and Cheng, Ming-Ming and Hou, Qibin},
                  journal={arXiv preprint arXiv:2309.09668},
                  year={2023}
                }