Deep learning methods for 3D scene understanding, particularly focused unsupervised methods.

3D Datasets

Data Representations

Multiview images: multiple 2D images of the same object from different angles.
Depth map
Voxel: volumetric occupancy
Point cloud
Polygon mesh
Functions as implicit representation

ScanNet Dataset

Scene: Indoor

ScanNet is a richly-annotated RGB-D video dataset. It has camera poses information, surface reconstruction, sementic segmentation labels.

Data Formats

`*.ply` Reconstructed surface mesh file

This is binary PLY format mesh with +Z axis in upright orientation.

`*.sens` RGB-D sensor stream

This is binary format with per-frame color, depth, camera pose and other data.

`*.segs.json` Surface mesh segmentation file

{
  "params": {  // segmentation parameters
   "kThresh": "0.0001",
   "segMinVerts": "20",
   "minPoints": "750",
   "maxPoints": "30000",
   "thinThresh": "0.05",
   "flatThresh": "0.001",
   "minLength": "0.02",
   "maxLength": "1"
  },
  "sceneId": "...",  // id of segmented scene
  "segIndices": [1,1,1,1,3,3,15,15,15,15],  // per-vertex seg label
}

Usage

Check SpatioTemporalSegmentation/lib/pc_utils.py.

To load a mesh file:

from plyfile import PlyData
def load_ply(self, index):
    filepath = self.data_root / self.data_paths[index]
    plydata = PlyData.read(filepath)
    data = plydata.elements[0].data
    coords = np.array([data['x'], data['y'], data['z']], dtype=np.float32).T
    feats = np.array([data['red'], data['green'], data['blue']], dtype=np.float32).T
    labels = np.array(data['label'], dtype=np.int32)
    return coords, feats, labels, None

Tools

Minkowski Engine

The Minkowski Engine is an auto-differentiation library for sparse tensors. It supports all standard network layers such as convolution, pooling, unpooling, and broadcasting operations for sparse tensors.

3D voxel feature maps are often sparse. Minkowski Engine generalizes the definition of 3D sparse convolution and provide implementation support for it. The sparse convolution is defined as:

$\mathbf{x}_{\mathbf{u}}^{\text {out }}=\sum_{\mathbf{i} \in \mathcal{N}^{D}\left(\mathbf{u}, \mathcal{C}^{\text {in }}\right)} W_{\mathbf{i}} \mathbf{x}_{\mathbf{u}+\mathbf{i}}^{\text {in }} \text { for } \mathbf{u} \in \mathcal{C}^{\text {out }}$

This definition does not limit the shape of the kernel.

is the current convolution center.
is a set of offsets that define the shape of a kernel, centered at .
, are predefined input and output coordinates of sparse tensors.

Minkowski Engine uses COO format (Coordinate list) to store sparse tensors, meaning that, it stores a list of non-zero coordinates and corresponding features:

$\mathbf{C}=\left[\begin{array}{cccc} x_{1}^{1} & x_{1}^{2} & \cdots & x_{1}^{D} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N}^{1} & x_{N}^{2} & \cdots & x_{N}^{D} \end{array}\right], \mathbf{F}=\left[\begin{array}{c} \mathbf{f}_{1}^{T} \\ \vdots \\ \mathbf{f}_{N}^{T} \end{array}\right]$

Concepts

Coordinate Manager
Because the feature map is sparse, we need to dynamically find neighbors between non-zero elements. Minkowski Engine has a mechanism called coordinate manager that caches and reuses calcualted neighbor coordinates. When a MinkowskiEngine.SparseTensor is initialized, a coordinate manager is also created. The coordinate manager can be accessed with MinkowskiEngine.SparseTensor.coords_man, and it can be shared with a new sparse tensor by providing it as an argument during initialization.

Channel
The channel number in 3D context is the length of the feature vector.

Paper Digests

PointContrast: Unsupervised Pre-Training for 3D Point Cloud Understanding

Authors: Saining Xie, Jiatao Gu, Demin Guo, Charles R. Qi, Leonidas Guibas, Or Litany
Venue: ECCV 2020
Institution: Facebook AI & Stanford

Motivation

To design a full pipeline for 3D unsupervised pretraining: view generation methods, downstream task compatible backbone, contrastive loss design. Previous work (pretraining on ShapeNet) doesn’t work because of two reasons: source & target data domain gap and lack of point-level representation.

Methodology

PointContrast Pipeline: contrasting at point-level

Given point cloud , generate two views and that are aligned in the same world coordinates.
Compute the correspndence mapping between these two views. If then points and are a matched pair.
Randomly sample two geometric transformations , (translation, rotation, scaling), and apply each to the paired points.
Compute output point feature and .
Backprop to update NN with point-wise contrastive loss.

Contrastive Loss Design
The intuition of contrastive loss is to seperate negative pairs and pull together positive pairs as much as possible.

Hardest-Contrastive Loss $\mathcal{L}_{c}=\sum_{(i, j) \in \mathcal{P}}\left\{\left[d\left(\mathbf{f}_{i}, \mathbf{f}_{j}\right)-m_{p}\right]_{+}^{2} /|\mathcal{P}|+0.5\left[m_{n}-\min _{k \in \mathcal{N}} d\left(\mathbf{f}_{i}, \mathbf{f}_{k}\right)\right]_{+}^{2} /\left|\mathcal{N}_{i}\right|+0.5\left[m_{n}-\min _{k \in \mathcal{N}} d\left(\mathbf{f}_{j}, \mathbf{f}_{k}\right)\right]_{+}^{2} /\left|\mathcal{N}_{j}\right|\right\}$

, are the features of matched points.
: Euclidean distance.
and are positive/negative margins, they are hyperparameters.
is a randomly sampled set of non-matched (negative) points.
hardest sample means the closest point in the normalized feature space. Because the closest negative point in the feature space is the hardest to separate.

PointInfoNCE Loss $\mathcal{L}_{c}=-\sum_{(i, j) \in \mathcal{P}} \log \frac{\exp \left(\mathbf{f}_{i} \cdot \mathbf{f}_{j} / \tau\right)}{\sum_{(\cdot, k) \in \mathcal{P}} \exp \left(\mathbf{f}_{i} \cdot \mathbf{f}_{k} / \tau\right)}$

PointInfoNCE is derived from InfoNCE. InfoNCE treat contrastive learning as classification problem, thus implemented with cross-entropy loss.

, annealing temperature, the output distribution goes “flat”, i.e. positive pair become less distinguishable. the output distribution goes sharp, i.e. positive pair becomes more distinguishable.

Backbone Design

Conclusions

Why Pretraining with ShapeNet dataset is not useful?

Domain gap between source and target dataset: alignment, scene context …
Point-level representation matters: local geometric feature is critical for 3D tasks
PointInfoNCE Loss outperforms Hardest-Contrastive Loss
The advantages of fully-convolutional design
Don’t have to crop objects out from scene context as done in previous works.
Enables “point-level metric learning”.

Questions and Answers

What is “view” and how is it generated?”
View is partial point cloud seen from different viewpoints. We can generate two views by cropping them from a complete point cloud. “Aligning two views in the same world coordinates” means to register them, i.e., find the common matched points.

How do we know the mapping between output features?
The SRUNet’s output point cloud has the exact same coordinates with its input (just like segmentation). So we can use the input points’ mapping to map features in the output sparse tensors.

Resources

Paper Lists

3D-PointCloud GitHub Repo

3D Scene Understanding

3D Datasets

Data Representations

ScanNet Dataset

Data Formats

`*.ply` Reconstructed surface mesh file

`*.sens` RGB-D sensor stream

`*.segs.json` Surface mesh segmentation file

Usage

Tools

Minkowski Engine

Concepts

Paper Digests

PointContrast: Unsupervised Pre-Training for 3D Point Cloud Understanding

Motivation

Methodology

Conclusions

Why Pretraining with ShapeNet dataset is not useful?

PointInfoNCE Loss outperforms Hardest-Contrastive Loss

The advantages of fully-convolutional design

Questions and Answers

Resources

Paper Lists

Blog Posts

3D Datasets

Data Representations

ScanNet Dataset

Data Formats

*.ply Reconstructed surface mesh file

*.sens RGB-D sensor stream

*.segs.json Surface mesh segmentation file

Usage

Tools

Minkowski Engine

Concepts

Paper Digests

PointContrast: Unsupervised Pre-Training for 3D Point Cloud Understanding

Motivation

Methodology

Conclusions

Why Pretraining with ShapeNet dataset is not useful?

PointInfoNCE Loss outperforms Hardest-Contrastive Loss

The advantages of fully-convolutional design

Questions and Answers

Resources

Paper Lists

Blog Posts

`*.ply` Reconstructed surface mesh file

`*.sens` RGB-D sensor stream

`*.segs.json` Surface mesh segmentation file