4,082 research outputs found
Seeing the Intangible: Surveying Automatic High-Level Visual Understanding from Still Images
The field of Computer Vision (CV) was born with the single grand goal of
complete image understanding: providing a complete semantic interpretation of
an input image. What exactly this goal entails is not immediately
straightforward, but theoretical hierarchies of visual understanding point
towards a top level of full semantics, within which sits the most complex and
subjective information humans can detect from visual data. In particular,
non-concrete concepts including emotions, social values and ideologies seem to
be protagonists of this "high-level" visual semantic understanding. While such
"abstract concepts" are critical tools for image management and retrieval,
their automatic recognition is still a challenge, exactly because they rest at
the top of the "semantic pyramid": the well-known semantic gap problem is
worsened given their lack of unique perceptual referents, and their reliance on
more unspecific features than concrete concepts. Given that there seems to be
very scarce explicit work within CV on the task of abstract social concept
(ASC) detection, and that many recent works seem to discuss similar
non-concrete entities by using different terminology, in this survey we provide
a systematic review of CV work that explicitly or implicitly approaches the
problem of abstract (specifically social) concept detection from still images.
Specifically, this survey performs and provides: (1) A study and clustering of
high level visual understanding semantic elements from a multidisciplinary
perspective (computer science, visual studies, and cognitive perspectives); (2)
A study and clustering of high level visual understanding computer vision tasks
dealing with the identified semantic elements, so as to identify current CV
work that implicitly deals with AC detection
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Cortex, countercurrent context, and dimensional integration of lifetime memory
The correlation between relative neocortex size and longevity in mammals encourages a search for a cortical function specifically related to the life-span. A candidate in the domain of permanent and cumulative memory storage is proposed and explored in relation to basic aspects of cortical organization. The pattern of cortico-cortical connectivity between functionally specialized areas and the laminar organization of that connectivity converges on a globally coherent representational space in which contextual embedding of information emerges as an obligatory feature of cortical function. This brings a powerful mode of inductive knowledge within reach of mammalian adaptations, a mode which combines item specificity with classificatory generality. Its neural implementation is proposed to depend on an obligatory interaction between the oppositely directed feedforward and feedback currents of cortical activity, in countercurrent fashion. Direct interaction of the two streams along their cortex-wide local interface supports a scheme of "contextual capture" for information storage responsible for the lifelong cumulative growth of a uniquely cortical form of memory termed "personal history." This approach to cortical function helps elucidate key features of cortical organization as well as cognitive aspects of mammalian life history strategies
Advanced Sensing and Image Processing Techniques for Healthcare Applications
This Special Issue aims to attract the latest research and findings in the design, development and experimentation of healthcare-related technologies. This includes, but is not limited to, using novel sensing, imaging, data processing, machine learning, and artificially intelligent devices and algorithms to assist/monitor the elderly, patients, and the disabled population
ModDrop: adaptive multi-modal gesture recognition
We present a method for gesture detection and localisation based on
multi-scale and multi-modal deep learning. Each visual modality captures
spatial information at a particular spatial scale (such as motion of the upper
body or a hand), and the whole system operates at three temporal scales. Key to
our technique is a training strategy which exploits: i) careful initialization
of individual modalities; and ii) gradual fusion involving random dropping of
separate channels (dubbed ModDrop) for learning cross-modality correlations
while preserving uniqueness of each modality-specific representation. We
present experiments on the ChaLearn 2014 Looking at People Challenge gesture
recognition track, in which we placed first out of 17 teams. Fusing multiple
modalities at several spatial and temporal scales leads to a significant
increase in recognition rates, allowing the model to compensate for errors of
the individual classifiers as well as noise in the separate channels.
Futhermore, the proposed ModDrop training technique ensures robustness of the
classifier to missing signals in one or several channels to produce meaningful
predictions from any number of available modalities. In addition, we
demonstrate the applicability of the proposed fusion scheme to modalities of
arbitrary nature by experiments on the same dataset augmented with audio.Comment: 14 pages, 7 figure
- …