12 research outputs found
Delving into Crispness: Guided Label Refinement for Crisp Edge Detection
Learning-based edge detection usually suffers from predicting thick edges.
Through extensive quantitative study with a new edge crispness measure, we find
that noisy human-labeled edges are the main cause of thick predictions. Based
on this observation, we advocate that more attention should be paid on label
quality than on model design to achieve crisp edge detection. To this end, we
propose an effective Canny-guided refinement of human-labeled edges whose
result can be used to train crisp edge detectors. Essentially, it seeks for a
subset of over-detected Canny edges that best align human labels. We show that
several existing edge detectors can be turned into a crisp edge detector
through training on our refined edge maps. Experiments demonstrate that deep
models trained with refined edges achieve significant performance boost of
crispness from 17.4% to 30.6%. With the PiDiNet backbone, our method improves
ODS and OIS by 12.2% and 12.6% on the Multicue dataset, respectively, without
relying on non-maximal suppression. We further conduct experiments and show the
superiority of our crisp edge detection for optical flow estimation and image
segmentation.Comment: Accepted by TI
Tensorformer: Normalized Matrix Attention Transformer for High-quality Point Cloud Reconstruction
Surface reconstruction from raw point clouds has been studied for decades in
the computer graphics community, which is highly demanded by modeling and
rendering applications nowadays. Classic solutions, such as Poisson surface
reconstruction, require point normals as extra input to perform reasonable
results. Modern transformer-based methods can work without normals, while the
results are less fine-grained due to limited encoding performance in local
fusion from discrete points. We introduce a novel normalized matrix attention
transformer (Tensorformer) to perform high-quality reconstruction. The proposed
matrix attention allows for simultaneous point-wise and channel-wise message
passing, while the previous vector attention loses neighbor point information
across different channels. It brings more degree of freedom in feature learning
and thus facilitates better modeling of local geometries. Our method achieves
state-of-the-art on two commonly used datasets, ShapeNetCore and ABC, and
attains 4% improvements on IOU on ShapeNet. Our implementation will be released
upon acceptance
2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds
The commonly adopted detect-then-match approach to registration finds
difficulties in the cross-modality cases due to the incompatible keypoint
detection and inconsistent feature description. We propose, 2D3D-MATR, a
detection-free method for accurate and robust registration between images and
point clouds. Our method adopts a coarse-to-fine pipeline where it first
computes coarse correspondences between downsampled patches of the input image
and the point cloud and then extends them to form dense correspondences between
pixels and points within the patch region. The coarse-level patch matching is
based on transformer which jointly learns global contextual constraints with
self-attention and cross-modality correlations with cross-attention. To resolve
the scale ambiguity in patch matching, we construct a multi-scale pyramid for
each image patch and learn to find for each point patch the best matching image
patch at a proper resolution level. Extensive experiments on two public
benchmarks demonstrate that 2D3D-MATR outperforms the previous state-of-the-art
P2-Net by around percentage points on inlier ratio and over points on
registration recall. Our code and models are available at
https://github.com/minhaolee/2D3DMATR.Comment: Accepted by ICCV 202
6DOF Pose Estimation of a 3D Rigid Object based on Edge-enhanced Point Pair Features
The point pair feature (PPF) is widely used for 6D pose estimation. In this
paper, we propose an efficient 6D pose estimation method based on the PPF
framework. We introduce a well-targeted down-sampling strategy that focuses
more on edge area for efficient feature extraction of complex geometry. A pose
hypothesis validation approach is proposed to resolve the symmetric ambiguity
by calculating edge matching degree. We perform evaluations on two challenging
datasets and one real-world collected dataset, demonstrating the superiority of
our method on pose estimation of geometrically complex, occluded, symmetrical
objects. We further validate our method by applying it to simulated punctures.Comment: 16 pages,20 figure
Image layer separation and application
Image layer separation is an important step for image understanding and facilitates many image processing applications. It aims to separate a single image into multiple image layers, decomposing different components of the image. Image layers are either physics-based layers such as the reflectance layer in intrinsic image decomposition, or semantic layers such as the occlusion layer in image de-hazing, raindrop removal problems. Since the number of unknowns is at least twice that of the inputs, image layer separation problems are ill-posed and challenging. In order to solve such ill-posed problems, traditional methods acquire additional constraints based on prior knowledge, and recent deep learning methods rely on training data. In this thesis, we propose an optimization-based method based on handcrafted priors for video de-fencing (separating fence-like occlusion layers from dynamic videos), and an unsupervised deep learning training scheme for utilizing unlabeled real images from the Internet, which is applied on highlight separation and intrinsic image decomposition. Traditional methods make assumptions based on observations and priors to acquire additional constraints and solve it as an optimization problem. In this thesis, we solve video de-fencing by a novel bottom-up pipeline based on such traditional optimization-based method. We present a fully automatic approach to detect and segment fence-like occluders from a video clip. Unlike previous approaches that usually assume either static scenes or cameras, our method is capable of handling both dynamic scenes and moving cameras. After that, we introduce the main challenges of recent deep learning methods for image layer separation, which is the lack of real-world training data with ground truth. Thus, we propose an unsupervised training scheme for training the network on unlabeled real images. This unsupervised training scheme is then applied to two image layer separation problems, which are highlight separation for facial images trained from celebrity photos, and non-Lambertian intrinsic image decomposition trained from customer product photos. Finally, we demonstrate one application from separated image layers, where we use faces as light probes to estimate the environment illumination. It is important for mixed reality applications, such as inserting virtual objects into real photos. Our technique estimates illumination at high precision in the form of a non-parametric environment map, and it works well for both indoor and outdoor scenes
STEdge: Self-training Edge Detection with Multi-layer Teaching and Regularization
Learning-based edge detection has hereunto been strongly supervised with
pixel-wise annotations which are tedious to obtain manually. We study the
problem of self-training edge detection, leveraging the untapped wealth of
large-scale unlabeled image datasets. We design a self-supervised framework
with multi-layer regularization and self-teaching. In particular, we impose a
consistency regularization which enforces the outputs from each of the multiple
layers to be consistent for the input image and its perturbed counterpart. We
adopt L0-smoothing as the 'perturbation' to encourage edge prediction lying on
salient boundaries following the cluster assumption in self-supervised
learning. Meanwhile, the network is trained with multi-layer supervision by
pseudo labels which are initialized with Canny edges and then iteratively
refined by the network as the training proceeds. The regularization and
self-teaching together attain a good balance of precision and recall, leading
to a significant performance boost over supervised methods, with lightweight
refinement on the target dataset. Furthermore, our method demonstrates strong
cross-dataset generality. For example, it attains 4.8% improvement for ODS and
5.8% for OIS when tested on the unseen BIPED dataset, compared to the
state-of-the-art methods
Multi-Resolution Monocular Depth Map Fusion by Self-Supervised Gradient-Based Composition
Monocular depth estimation is a challenging problem on which deep neural networks have demonstrated great potential. However, depth maps predicted by existing deep models usually lack fine-grained details due to convolution operations and down-samplings in networks. We find that increasing input resolution is helpful to preserve more local details while the estimation at low resolution is more accurate globally. Therefore, we propose a novel depth map fusion module to combine the advantages of estimations with multi-resolution inputs. Instead of merging the low- and high-resolution estimations equally, we adopt the core idea of Poisson fusion, trying to implant the gradient domain of high-resolution depth into the low-resolution depth. While classic Poisson fusion requires a fusion mask as supervision, we propose a self-supervised framework based on guided image filtering. We demonstrate that this gradient-based composition performs much better at noisy immunity, compared with the state-of-the-art depth map fusion method. Our lightweight depth fusion is one-shot and runs in real-time, making it 80X faster than a state-of-the-art depth fusion method. Quantitative evaluations demonstrate that the proposed method can be integrated into many fully convolutional monocular depth estimation backbones with a significant performance boost, leading to state-of-the-art results of detail enhancement on depth maps. Codes are released at https://github.com/yuinsky/gradient-based-depth-map-fusion