93 research outputs found
Real-time Quantitative Visual Inspection using Extended Reality
In this study, we propose a technique for quantitative visual inspection that can quantify structural damage using extended reality (XR). The XR headset can display and overlay graphical information on the physical space and process the data from the built-in camera and depth sensor. Also, the device permits accessing and analyzing image and video stream in real-time and utilizing 3D meshes of the environment and camera pose information. By leveraging these features for the XR headset, we build a workflow and graphic interface to capture the images, segment damage regions, and evaluate the physical size of damage. A deep learning-based interactive segmentation algorithm called f-BRS was deployed to precisely segment damage regions through the XR headset. A ray-casting algorithm is implemented to obtain 3D locations corresponding to the pixel locations of the damage region on the image. The size of the damage region is computed from the 3D locations of its boundary. The performance of the proposed method is demonstrated through a field experiment at an in-service bridge where spalling damage is present at its abutment. The experiment shows that the proposed method provides sub-centimeter accuracy for the size estimation
CFR-ICL: Cascade-Forward Refinement with Iterative Click Loss for Interactive Image Segmentation
The click-based interactive segmentation aims to extract the object of
interest from an image with the guidance of user clicks. Recent work has
achieved great overall performance by employing the segmentation from the
previous output. However, in most state-of-the-art approaches, 1) the inference
stage involves inflexible heuristic rules and a separate refinement model; and
2) the training cannot balance the number of user clicks and model performance.
To address the challenges, we propose a click-based and mask-guided interactive
image segmentation framework containing three novel components: Cascade-Forward
Refinement (CFR), Iterative Click Loss (ICL), and SUEM image augmentation. The
proposed ICL allows model training to improve segmentation and reduce user
interactions simultaneously. The CFR offers a unified inference framework to
generate segmentation results in a coarse-to-fine manner. The proposed SUEM
augmentation is a comprehensive way to create large and diverse training sets
for interactive image segmentation. Extensive experiments demonstrate the
state-of-the-art performance of the proposed approach on five public datasets.
Remarkably, our model achieves an average of 2.9 and 7.5 clicks of NoC@95 on
the Berkeley and DAVIS sets, respectively, improving by 33.2% and 15.5% over
the previous state-of-the-art results. The code and trained model are available
at https://github.com/TitorX/CFR-ICL-Interactive-Segmentation
Interactive Class-Agnostic Object Counting
We propose a novel framework for interactive class-agnostic object counting,
where a human user can interactively provide feedback to improve the accuracy
of a counter. Our framework consists of two main components: a user-friendly
visualizer to gather feedback and an efficient mechanism to incorporate it. In
each iteration, we produce a density map to show the current prediction result,
and we segment it into non-overlapping regions with an easily verifiable number
of objects. The user can provide feedback by selecting a region with obvious
counting errors and specifying the range for the estimated number of objects
within it. To improve the counting result, we develop a novel adaptation loss
to force the visual counter to output the predicted count within the
user-specified range. For effective and efficient adaptation, we propose a
refinement module that can be used with any density-based visual counter, and
only the parameters in the refinement module will be updated during adaptation.
Our experiments on two challenging class-agnostic object counting benchmarks,
FSCD-LVIS and FSC-147, show that our method can reduce the mean absolute error
of multiple state-of-the-art visual counters by roughly 30% to 40% with minimal
user input. Our project can be found at
https://yifehuang97.github.io/ICACountProjectPage/
Neural Interactive Keypoint Detection
This work proposes an end-to-end neural interactive keypoint detection
framework named Click-Pose, which can significantly reduce more than 10 times
labeling costs of 2D keypoint annotation compared with manual-only annotation.
Click-Pose explores how user feedback can cooperate with a neural keypoint
detector to correct the predicted keypoints in an interactive way for a faster
and more effective annotation process. Specifically, we design the pose error
modeling strategy that inputs the ground truth pose combined with four typical
pose errors into the decoder and trains the model to reconstruct the correct
poses, which enhances the self-correction ability of the model. Then, we attach
an interactive human-feedback loop that allows receiving users' clicks to
correct one or several predicted keypoints and iteratively utilizes the decoder
to update all other keypoints with a minimum number of clicks (NoC) for
efficient annotation. We validate Click-Pose in in-domain, out-of-domain
scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only
needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing
31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction,
respectively. Besides, without user clicks, Click-Pose surpasses the previous
end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. The code is
available at https://github.com/IDEA-Research/Click-Pose.Comment: Accepted to ICCV 202
DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer
Most state-of-the-art instance segmentation methods rely on large amounts of
pixel-precise ground-truth annotations for training, which are expensive to
create. Interactive segmentation networks help generate such annotations based
on an image and the corresponding user interactions such as clicks. Existing
methods for this task can only process a single instance at a time and each
user interaction requires a full forward pass through the entire deep network.
We introduce a more efficient approach, called DynaMITe, in which we represent
user interactions as spatio-temporal queries to a Transformer decoder with a
potential to segment multiple object instances in a single iteration. Our
architecture also alleviates any need to re-compute image features during
refinement, and requires fewer interactions for segmenting multiple instances
in a single image when compared to other methods. DynaMITe achieves
state-of-the-art results on multiple existing interactive segmentation
benchmarks, and also on the new multi-instance benchmark that we propose in
this paper
- …