18 research outputs found
Learning the What and How of Annotation in Video Object Segmentation
Video Object Segmentation (VOS) is crucial for several applications, from
video editing to video data generation. Training a VOS model requires an
abundance of manually labeled training videos. The de-facto traditional way of
annotating objects requires humans to draw detailed segmentation masks on the
target objects at each video frame. This annotation process, however, is
tedious and time-consuming. To reduce this annotation cost, in this paper, we
propose EVA-VOS, a human-in-the-loop annotation framework for video object
segmentation. Unlike the traditional approach, we introduce an agent that
predicts iteratively both which frame ("What") to annotate and which annotation
type ("How") to use. Then, the annotator annotates only the selected frame that
is used to update a VOS module, leading to significant gains in annotation
time. We conduct experiments on the MOSE and the DAVIS datasets and we show
that: (a) EVA-VOS leads to masks with accuracy close to the human agreement
3.5x faster than the standard way of annotating videos; (b) our frame selection
achieves state-of-the-art performance; (c) EVA-VOS yields significant
performance gains in terms of annotation time compared to all other methods and
baselines.Comment: Accepted to WACV 202
CrashCar101: Procedural Generation for Damage Assessment
In this paper, we are interested in addressing the problem of damage
assessment for vehicles, such as cars. This task requires not only detecting
the location and the extent of the damage but also identifying the damaged
part. To train a computer vision system for the semantic part and damage
segmentation in images, we need to manually annotate images with costly pixel
annotations for both part categories and damage types. To overcome this need,
we propose to use synthetic data to train these models. Synthetic data can
provide samples with high variability, pixel-accurate annotations, and
arbitrarily large training sets without any human intervention. We propose a
procedural generation pipeline that damages 3D car models and we obtain
synthetic 2D images of damaged cars paired with pixel-accurate annotations for
part and damage categories. To validate our idea, we execute our pipeline and
render our CrashCar101 dataset. We run experiments on three real datasets for
the tasks of part and damage segmentation. For part segmentation, we show that
the segmentation models trained on a combination of real data and our synthetic
data outperform all models trained only on real data. For damage segmentation,
we show the sim2real transfer ability of CrashCar101.Comment: Accepted at WACV 202
How to make a pizza: Learning a compositional layer-based GAN model
A food recipe is an ordered set of instructions for preparing a particular
dish. From a visual perspective, every instruction step can be seen as a way to
change the visual appearance of the dish by adding extra objects (e.g., adding
an ingredient) or changing the appearance of the existing ones (e.g., cooking
the dish). In this paper, we aim to teach a machine how to make a pizza by
building a generative model that mirrors this step-by-step procedure. To do so,
we learn composable module operations which are able to either add or remove a
particular ingredient. Each operator is designed as a Generative Adversarial
Network (GAN). Given only weak image-level supervision, the operators are
trained to generate a visual layer that needs to be added to or removed from
the existing image. The proposed model is able to decompose an image into an
ordered sequence of layers by applying sequentially in the right order the
corresponding removing modules. Experimental results on synthetic and real
pizza images demonstrate that our proposed model is able to: (1) segment pizza
toppings in a weaklysupervised fashion, (2) remove them by revealing what is
occluded underneath them (i.e., inpainting), and (3) infer the ordering of the
toppings without any depth ordering supervision. Code, data, and models are
available online.Comment: CVPR 201
Training Object Class Detectors from Eye Tracking Data
Abstract. Training an object class detector typically requires a large set of im-ages annotated with bounding-boxes, which is expensive and time consuming to create. We propose novel approach to annotate object locations which can sub-stantially reduce annotation time. We first track the eye movements of annota-tors instructed to find the object and then propose a technique for deriving ob-ject bounding-boxes from these fixations. To validate our idea, we collected eye tracking data for the trainval part of 10 object classes of Pascal VOC 2012 (6,270 images, 5 observers). Our technique correctly produces bounding-boxes in 50% of the images, while reducing the total annotation time by factor 6.8 × compared to drawing bounding-boxes. Any standard object class detector can be trained on the bounding-boxes predicted by our model. Our large scale eye tracking dataset is available at groups.inf.ed.ac.uk/calvin/eyetrackdataset/.
Coarse-to-Fine Annotation Enrichment for Semantic Segmentation Learning
Rich high-quality annotated data is critical for semantic segmentation
learning, yet acquiring dense and pixel-wise ground-truth is both labor- and
time-consuming. Coarse annotations (e.g., scribbles, coarse polygons) offer an
economical alternative, with which training phase could hardly generate
satisfactory performance unfortunately. In order to generate high-quality
annotated data with a low time cost for accurate segmentation, in this paper,
we propose a novel annotation enrichment strategy, which expands existing
coarse annotations of training data to a finer scale. Extensive experiments on
the Cityscapes and PASCAL VOC 2012 benchmarks have shown that the neural
networks trained with the enriched annotations from our framework yield a
significant improvement over that trained with the original coarse labels. It
is highly competitive to the performance obtained by using human annotated
dense annotations. The proposed method also outperforms among other
state-of-the-art weakly-supervised segmentation methods.Comment: CIKM 2018 International Conference on Information and Knowledge
Managemen
We don't need no bounding-boxes: Training object class detectors using only human verification
Training object class detectors typically requires a large set of images in
which objects are annotated by bounding-boxes. However, manually drawing
bounding-boxes is very time consuming. We propose a new scheme for training
object detectors which only requires annotators to verify bounding-boxes
produced automatically by the learning algorithm. Our scheme iterates between
re-training the detector, re-localizing objects in the training images, and
human verification. We use the verification signal both to improve re-training
and to reduce the search space for re-localisation, which makes these steps
different to what is normally done in a weakly supervised setting. Extensive
experiments on PASCAL VOC 2007 show that (1) using human verification to update
detectors and reduce the search space leads to the rapid production of
high-quality bounding-box annotations; (2) our scheme delivers detectors
performing almost as good as those trained in a fully supervised setting,
without ever drawing any bounding-box; (3) as the verification task is very
quick, our scheme substantially reduces total annotation time by a factor
6x-9x.Comment: CVPR 2016, pp. 854-863. Las Vegas, N
Training object class detectors with click supervision
Training object class detectors typically requires a large set of images with
objects annotated by bounding boxes. However, manually drawing bounding boxes
is very time consuming. In this paper we greatly reduce annotation time by
proposing center-click annotations: we ask annotators to click on the center of
an imaginary bounding box which tightly encloses the object instance. We then
incorporate these clicks into existing Multiple Instance Learning techniques
for weakly supervised object localization, to jointly localize object bounding
boxes over all training images. Extensive experiments on PASCAL VOC 2007 and MS
COCO show that: (1) our scheme delivers high-quality detectors, performing
substantially better than those produced by weakly supervised techniques, with
a modest extra annotation effort; (2) these detectors in fact perform in a
range close to those trained from manually drawn bounding boxes; (3) as the
center-click task is very fast, our scheme reduces total annotation time by 9x
to 18x.Comment: CVPR 201
Learning the What and How of Annotation in Video Object Segmentation
Video Object Segmentation (VOS) is crucial for several applications, from video editing to video data generation. Training a VOS model requires an abundance of manually labeled training videos. The de-facto traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. This annotation process, however, is tedious and time-consuming. To reduce this annotation cost, in this paper, we propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation. Unlike the traditional approach, we introduce an agent that predicts iteratively both which frame ("What") to annotate and which annotation type ("How") to use. Then, the annotator annotates only the selected frame that is used to update a VOS module, leading to significant gains in annotation time. We conduct experiments on the MOSE and the DAVIS datasets and we show that: (a) EVA-VOS leads to masks with accuracy close to the human agreement 3.5× faster than the standard way of annotating videos; (b) our frame selection achieves state-of-the-art performance; (c) EVA-VOS yields significant performance gains in terms of annotation time compared to all other methods and baselines
CrashCar101: Procedural Generation for Damage Assessment
In this paper, we are interested in addressing the problem of damage assessment for vehicles, such as cars. This task requires not only detecting the location and the extent of the damage but also identifying the damaged part. To train a computer vision system for the semantic part and damage segmentation in images, we need to manually annotate images with costly pixel annotations for both part categories and damage types. To overcome this need, we propose to use synthetic data to train these models. Synthetic data can provide samples with high variability, pixel-accurate annotations, and arbitrarily large training sets without any human intervention. We propose a procedural generation pipeline that damages 3D car models and we obtain synthetic 2D images of damaged cars paired with pixel-accurate annotations for part and damage categories. To validate our idea, we execute our pipeline and render our CrashCar101 dataset. We run experiments on three real datasets for the tasks of part and damage segmentation. For part segmentation, we show that the segmentation models trained on a combination of real data and our synthetic data outperform all models trained only on real data. For damage segmentation, we show the sim2real transfer ability of CrashCar101