12,224 research outputs found
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity Recognition
Recent achievements in language models have showcased their extraordinary
capabilities in bridging visual information with semantic language
understanding. This leads us to a novel question: can language models connect
textual semantics with IoT sensory signals to perform recognition tasks, e.g.,
Human Activity Recognition (HAR)? If so, an intelligent HAR system with
human-like cognition can be built, capable of adapting to new environments and
unseen categories. This paper explores its feasibility with an innovative
approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointly
aligns textual embeddings with IoT sensor signals, including camera video,
LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive a
unified semantic feature space that aligns multi-modal features with language
embeddings, so that the IoT data corresponds to specific words that describe
the IoT data. To enhance the connection between textual categories and their
IoT data, we propose supplementary descriptions and learnable prompts that
bring more semantic information into the joint feature space. TENT can not only
recognize actions that have been seen but also ``guess'' the unseen action by
the closest textual words from the feature space. We demonstrate TENT achieves
state-of-the-art performance on zero-shot HAR tasks using different modalities,
improving the best vision-language models by over 12%.Comment: Preprint manuscript in submissio
Recognition, Analysis, and Assessments of Human Skills using Wearable Sensors
One of the biggest social issues in mature societies such as Europe and Japan
is the aging population and declining birth rate. These societies have a serious
problem with the retirement of the expert workers, doctors, and engineers etc.
Especially in the sectors that require long time to make experts in fields like medicine and industry; the retirement and injuries of the experts, is a serious problem. The technology to support the training and assessment of skilled workers (like doctors, manufacturing
workers) is strongly required for the society. Although there are some solutions for
this problem, most of them are video-based which violates the privacy of the subjects.
Furthermore, they are not easy to deploy due to the need for large training data.
This thesis provides a novel framework to recognize, analyze, and assess human
skills with minimum customization cost. The presented framework tackles this problem
in two different domains, industrial setup and medical operations of catheter-based
cardiovascular interventions (CBCVI).
In particular, the contributions of this thesis are four-fold. First, it proposes an
easy-to-deploy framework for human activity recognition based on zero-shot learning
approach, which is based on learning basic actions and objects. The model recognizes
unseen activities by combinations of basic actions learned in a preliminary way and involved objects. Therefore, it is completely configurable by the user and can be used to detect completely new activities.
Second, a novel gaze-estimation model for attention driven object detection task is
presented. The key features of the model are: (i) usage of the deformable convolutional
layers to better incorporate spatial dependencies of different shapes of objects and
backgrounds, (ii) formulation of the gaze-estimation problem in two different way, as a
classification as well as a regression problem. We combine both formulations using a
joint loss that incorporates both the cross-entropy as well as the mean-squared error in
order to train our model. This enhanced the accuracy of the model from 6.8 by using only
the cross-entropy loss to 6.4 for the joint loss.
The third contribution of this thesis targets the area of quantification of quality of
i
actions using wearable sensor. To address the variety of scenarios, we have targeted two
possibilities: a) both expert and novice data is available , b) only expert data is available,
a quite common case in safety critical scenarios.
Both of the developed methods from these scenarios are deep learning based. In the
first one, we use autoencoders with OneClass SVM, and in the second one we use the
Siamese Networks. These methods allow us to encode the expert’s expertise and to learn
the differences between novice and expert workers. This enables quantification of the
performance of the novice in comparison to the expert worker.
The fourth contribution, explicitly targets medical practitioners and provides a
methodology for novel gaze-based temporal spatial analysis of CBCVI data. The developed
methodology allows continuous registration and analysis of gaze data for analysis
of the visual X-ray image processing (XRIP) strategies of expert operators in live-cases scenarios and may assist in transferring experts’ reading skills to novices
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Cross-modal learning from visual information for activity recognition on inertial sensors
The lack of large-scale, labeled datasets impedes progress in developing robust and generalized predictive models for human activity recognition (HAR) from wearable inertial sensor data. Labeled data is scarce as sensor data collection is expensive, and their annotation is time-consuming and error-prone. As a result, public inertial HAR datasets are small in terms of number of subjects, activity classes, hours of recorded data, and variation in recorded environments. Machine learning models, developed using these small datasets, are effectively blind to the diverse expressions of activities performed by wide-ranging populations in the real world, and progress in wearable inertial sensing is held back by this bottleneck for activity understanding. .
But just as Internet-scale text, image and audio data have pushed their respective pattern recognition fields to systems reliable enough for everyday use, easy access to large quantities of data can push forward the field of inertial HAR, and by extension wearable sensing. To this end, this thesis pioneers the idea of exploiting the visual modality as a source domain for cross-modal learning, such that data and knowledge can be transferred across to benefit the target domain of inertial HAR. .
This thesis makes three contributions to inertial HAR through cross-modal approaches. First, to overcome the barrier of expensive inertial data collection and annotation, we contribute a novel pipeline that automatically extracts virtual accelerometer data from videos of human activities, which are readily annotated and accessible in large quantities. Second, we propose acquiring transferable representations about activities, from HAR models trained using large quantities of visual data to enrich the development of inertial HAR models. Finally, the third contribution exposes HAR models to the challenging setting of zero-shot learning; we propose mechanisms that leverage cross-modal correspondence to enable inference on previously unseen classes. .
Unlike prior approaches, this body of work pushes forward the state of the art in HAR not by exhausting resources concentrated in the inertial domain, but by exploiting an existing, resourceful, intuitive, and informative source, the visual domain. These contributions represent a new line of cross-modal thinking in inertial HAR, and suggest important future directions for inertial-based wearable sensing research
Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition
To properly assist humans in their needs, human activity recognition (HAR)
systems need the ability to fuse information from multiple modalities. Our
hypothesis is that multimodal sensors, visual and non-visual tend to provide
complementary information, addressing the limitations of other modalities. In
this work, we propose a multi-modal framework that learns to effectively
combine features from RGB Video and IMU sensors, and show its robustness for
MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the
first stage, each input encoder learns to effectively extract features, and in
the second stage, learns to combine these individual features. We show
significant improvements of 22% and 11% compared to video only and IMU only
setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive
experimentation, we show the robustness of our model on zero shot setting, and
limited annotated data setting. We further compare with state-of-the-art
methods that use more input modalities and show that our method outperforms
significantly on the more difficult MMact dataset, and performs comparably in
UTD-MHAD dataset
AmicroN: A Framework for Generating Annotations for Human Activity Recognition with Granular Micro-Activities
Efficient human activity recognition (HAR) using sensor data needs a
significant volume of annotated data. The growing volume of unlabelled sensor
data has challenged conventional practices for gathering HAR annotations with
human-in-the-loop approaches, often leading to the collection of shallower
annotations. These shallower annotations ignore the fine-grained
micro-activities that constitute any complex activities of daily living (ADL).
Understanding this, we, in this paper, first analyze this lack of granular
annotations from available pre-annotated datasets to understand the practical
inconsistencies and also perform a detailed survey to look into the human
perception surrounding annotations. Drawing motivations from these, we next
develop the framework AmicroN that can automatically generate micro-activity
annotations using locomotive signatures and the available coarse-grain
macro-activity labels. In the backend, AmicroN applies change-point detection
followed by zero-shot learning with activity embeddings to identify the unseen
micro-activities in an unsupervised manner. Rigorous evaluation on publicly
available datasets shows that AmicroN can accurately generate micro-activity
annotations with a median F1-score of >0.75. Additionally, we also show that
AmicroN can be used in a plug-and-play manner with Large Language Models (LLMs)
to obtain the micro-activity labels, thus making it more practical for
realistic applications.Comment: 27 pages, 5 tables, 9 figure
- …