14,536 research outputs found
LabelFormer: Object Trajectory Refinement for Offboard Perception from LiDAR Point Clouds
A major bottleneck to scaling-up training of self-driving perception systems
are the human annotations required for supervision. A promising alternative is
to leverage "auto-labelling" offboard perception models that are trained to
automatically generate annotations from raw LiDAR point clouds at a fraction of
the cost. Auto-labels are most commonly generated via a two-stage approach --
first objects are detected and tracked over time, and then each object
trajectory is passed to a learned refinement model to improve accuracy. Since
existing refinement models are overly complex and lack advanced temporal
reasoning capabilities, in this work we propose LabelFormer, a simple,
efficient, and effective trajectory-level refinement approach. Our approach
first encodes each frame's observations separately, then exploits
self-attention to reason about the trajectory with full temporal context, and
finally decodes the refined object size and per-frame poses. Evaluation on both
urban and highway datasets demonstrates that LabelFormer outperforms existing
works by a large margin. Finally, we show that training on a dataset augmented
with auto-labels generated by our method leads to improved downstream detection
performance compared to existing methods. Please visit the project website for
details https://waabi.ai/labelformerComment: 20 pages, 8 figures, 7 table
Activity-conditioned continuous human pose estimation for performance analysis of athletes using the example of swimming
In this paper we consider the problem of human pose estimation in real-world
videos of swimmers. Swimming channels allow filming swimmers simultaneously
above and below the water surface with a single stationary camera. These
recordings can be used to quantitatively assess the athletes' performance. The
quantitative evaluation, so far, requires manual annotations of body parts in
each video frame. We therefore apply the concept of CNNs in order to
automatically infer the required pose information. Starting with an
off-the-shelf architecture, we develop extensions to leverage activity
information - in our case the swimming style of an athlete - and the continuous
nature of the video recordings. Our main contributions are threefold: (a) We
apply and evaluate a fine-tuned Convolutional Pose Machine architecture as a
baseline in our very challenging aquatic environment and discuss its error
modes, (b) we propose an extension to input swimming style information into the
fully convolutional architecture and (c) modify the architecture for continuous
pose estimation in videos. With these additions we achieve reliable pose
estimates with up to +16% more correct body joint detections compared to the
baseline architecture.Comment: 10 pages, 9 figures, accepted at WACV 201
A Data-Driven Approach for Tag Refinement and Localization in Web Videos
Tagging of visual content is becoming more and more widespread as web-based
services and social networks have popularized tagging functionalities among
their users. These user-generated tags are used to ease browsing and
exploration of media collections, e.g. using tag clouds, or to retrieve
multimedia content. However, not all media are equally tagged by users. Using
the current systems is easy to tag a single photo, and even tagging a part of a
photo, like a face, has become common in sites like Flickr and Facebook. On the
other hand, tagging a video sequence is more complicated and time consuming, so
that users just tag the overall content of a video. In this paper we present a
method for automatic video annotation that increases the number of tags
originally provided by users, and localizes them temporally, associating tags
to keyframes. Our approach exploits collective knowledge embedded in
user-generated tags and web sources, and visual similarity of keyframes and
images uploaded to social sites like YouTube and Flickr, as well as web sources
like Google and Bing. Given a keyframe, our method is able to select on the fly
from these visual sources the training exemplars that should be the most
relevant for this test sample, and proceeds to transfer labels across similar
images. Compared to existing video tagging approaches that require training
classifiers for each tag, our system has few parameters, is easy to implement
and can deal with an open vocabulary scenario. We demonstrate the approach on
tag refinement and localization on DUT-WEBV, a large dataset of web videos, and
show state-of-the-art results.Comment: Preprint submitted to Computer Vision and Image Understanding (CVIU
Component Composition in Business and System Modelling
Bespoke development of large business systems can be couched in terms of the composition of components, which are, put simply, chunks of development work. Design, mapping a specification to an implementation, can also be expressed in terms of components: a refinement comprising an abstract component, a concrete component and a mapping between them. Similarly, system extension is the composition of an existing component, the legacy system, with a new component, the extension. This paper overviews work being done on a UK EPSRC funded research project formulating and formalizing techniques for describing, composing and performing integrity checks on components. Although the paper focuses on the specification and development of information systems, the techniques are equally applicable to the modeling and re-engineering of businesses, where no computer system may be involved
- …