14 research outputs found
Fluid Annotation: A Human-Machine Collaboration Interface for Full Image Annotation
We introduce Fluid Annotation, an intuitive human-machine collaboration
interface for annotating the class label and outline of every object and
background region in an image. Fluid annotation is based on three principles:
(I) Strong Machine-Learning aid. We start from the output of a strong neural
network model, which the annotator can edit by correcting the labels of
existing regions, adding new regions to cover missing objects, and removing
incorrect regions. The edit operations are also assisted by the model. (II)
Full image annotation in a single pass. As opposed to performing a series of
small annotation tasks in isolation, we propose a unified interface for full
image annotation in a single pass. (III) Empower the annotator. We empower the
annotator to choose what to annotate and in which order. This enables
concentrating on what the machine does not already know, i.e. putting human
effort only on the errors it made. This helps using the annotation budget
effectively. Through extensive experiments on the COCO+Stuff dataset, we
demonstrate that Fluid Annotation leads to accurate annotations very
efficiently, taking three times less annotation time than the popular LabelMe
interface.Comment: ACM MultiMedia 2018. Live demo is available at fluidann.appspot.co
Weakly Supervised Video Salient Object Detection via Point Supervision
Video salient object detection models trained on pixel-wise dense annotation
have achieved excellent performance, yet obtaining pixel-by-pixel annotated
datasets is laborious. Several works attempt to use scribble annotations to
mitigate this problem, but point supervision as a more labor-saving annotation
method (even the most labor-saving method among manual annotation methods for
dense prediction), has not been explored. In this paper, we propose a strong
baseline model based on point supervision. To infer saliency maps with temporal
information, we mine inter-frame complementary information from short-term and
long-term perspectives, respectively. Specifically, we propose a hybrid token
attention module, which mixes optical flow and image information from
orthogonal directions, adaptively highlighting critical optical flow
information (channel dimension) and critical token information (spatial
dimension). To exploit long-term cues, we develop the Long-term Cross-Frame
Attention module (LCFA), which assists the current frame in inferring salient
objects based on multi-frame tokens. Furthermore, we label two point-supervised
datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset.
Experiments on the six benchmark datasets illustrate our method outperforms the
previous state-of-the-art weakly supervised methods and even is comparable with
some fully supervised approaches. Source code and datasets are available.Comment: accepted by ACM MM 202
Feature Decoupling-Recycling Network for Fast Interactive Segmentation
Recent interactive segmentation methods iteratively take source image, user
guidance and previously predicted mask as the input without considering the
invariant nature of the source image. As a result, extracting features from the
source image is repeated in each interaction, resulting in substantial
computational redundancy. In this work, we propose the Feature
Decoupling-Recycling Network (FDRN), which decouples the modeling components
based on their intrinsic discrepancies and then recycles components for each
user interaction. Thus, the efficiency of the whole interactive process can be
significantly improved. To be specific, we apply the Decoupling-Recycling
strategy from three perspectives to address three types of discrepancies,
respectively. First, our model decouples the learning of source image semantics
from the encoding of user guidance to process two types of input domains
separately. Second, FDRN decouples high-level and low-level features from
stratified semantic representations to enhance feature learning. Third, during
the encoding of user guidance, current user guidance is decoupled from
historical guidance to highlight the effect of current user guidance. We
conduct extensive experiments on 6 datasets from different domains and
modalities, which demonstrate the following merits of our model: 1) superior
efficiency than other methods, particularly advantageous in challenging
scenarios requiring long-term interactions (up to 4.25x faster), while
achieving favorable segmentation performance; 2) strong applicability to
various methods serving as a universal enhancement technique; 3) well
cross-task generalizability, e.g., to medical image segmentation, and
robustness against misleading user guidance.Comment: Accepted to ACM MM 202