584,663 research outputs found
PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition
We present a new algorithm for selection of informative frames in video
action recognition. Our approach is designed for aerial videos captured using a
moving camera where human actors occupy a small spatial resolution of video
frames. Our algorithm utilizes the motion bias within aerial videos, which
enables the selection of motion-salient frames. We introduce the concept of
patch mutual information (PMI) score to quantify the motion bias between
adjacent frames, by measuring the similarity of patches. We use this score to
assess the amount of discriminative motion information contained in one frame
relative to another. We present an adaptive frame selection strategy using
shifted leaky ReLu and cumulative distribution function, which ensures that the
sampled frames comprehensively cover all the essential segments with high
motion salience. Our approach can be integrated with any action recognition
model to enhance its accuracy. In practice, our method achieves a relative
improvement of 2.2 - 13.8% in top-1 accuracy on UAV-Human, 6.8% on NEC Drone,
and 9.0% on Diving48 datasets
fpgaHART: A toolflow for throughput-oriented acceleration of 3D CNNs for HAR onto FPGAs
Surveillance systems, autonomous vehicles, human monitoring systems, and
video retrieval are just few of the many applications in which 3D Convolutional
Neural Networks are exploited. However, their extensive use is restricted by
their high computational and memory requirements, especially when integrated
into systems with limited resources. This study proposes a toolflow that
optimises the mapping of 3D CNN models for Human Action Recognition onto FPGA
devices, taking into account FPGA resources and off-chip memory
characteristics. The proposed system employs Synchronous Dataflow (SDF) graphs
to model the designs and introduces transformations to expand and explore the
design space, resulting in high-throughput designs. A variety of 3D CNN models
were evaluated using the proposed toolflow on multiple FPGA devices,
demonstrating its potential to deliver competitive performance compared to
earlier hand-tuned and model-specific designs.Comment: 7 pages, 3 figures, 4 tables. arXiv admin note: substantial text
overlap with arXiv:2305.1847
Progressive search space reduction for human pose estimation
The objective of this paper is to estimate 2D human pose as a spatial configuration of body parts in TV and movie video shots. Such video material is uncontrolled and extremely challenging. We propose an approach that progressively reduces the search space for body parts, to greatly improve the chances that pose estimation will succeed. This involves two contributions: (i) a generic detector using a weak model of pose to substantially reduce the full pose search space; and (ii) employing ‘grabcut ’ initialized on detected regions proposed by the weak model, to further prune the search space. Moreover, we also propose (iii) an integrated spatiotemporal model covering multiple frames to refine pose estimates from individual frames, with inference using belief propagation. The method is fully automatic and self-initializing, and explains the spatio-temporal volume covered by a person moving in a shot, by soft-labeling every pixel as belonging to a particular body part or to the background. We demonstrate upper-body pose estimation by an extensive evaluation over 70000 frames from four episodes of the TV series Buffy the vampire slayer, and present an application to fullbody action recognition on the Weizmann dataset. 1
Multi-Modal Human-Machine Communication for Instructing Robot Grasping Tasks
A major challenge for the realization of intelligent robots is to supply them
with cognitive abilities in order to allow ordinary users to program them
easily and intuitively. One way of such programming is teaching work tasks by
interactive demonstration. To make this effective and convenient for the user,
the machine must be capable to establish a common focus of attention and be
able to use and integrate spoken instructions, visual perceptions, and
non-verbal clues like gestural commands. We report progress in building a
hybrid architecture that combines statistical methods, neural networks, and
finite state machines into an integrated system for instructing grasping tasks
by man-machine interaction. The system combines the GRAVIS-robot for visual
attention and gestural instruction with an intelligent interface for speech
recognition and linguistic interpretation, and an modality fusion module to
allow multi-modal task-oriented man-machine communication with respect to
dextrous robot manipulation of objects.Comment: 7 pages, 8 figure
Design and implementation of a user-oriented speech recognition interface: the synergy of technology and human factors
The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology and human factors is achieved. This synergy is very important for making speech interfaces a natural and acceptable form of human-machine interaction. Important concepts such as interfaces, human factors and speech recognition are discussed. Additionally, an indication is given as to how the synergy of human factors and technology can be realised by a sketch of the interface's implementation. An explanation is also provided of how the interface might be integrated in different applications fruitfully
- …