12,224 research outputs found

    Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

    Get PDF
    This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly

    TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity Recognition

    Full text link
    Recent achievements in language models have showcased their extraordinary capabilities in bridging visual information with semantic language understanding. This leads us to a novel question: can language models connect textual semantics with IoT sensory signals to perform recognition tasks, e.g., Human Activity Recognition (HAR)? If so, an intelligent HAR system with human-like cognition can be built, capable of adapting to new environments and unseen categories. This paper explores its feasibility with an innovative approach, IoT-sEnsors-language alignmEnt pre-Training (TENT), which jointly aligns textual embeddings with IoT sensor signals, including camera video, LiDAR, and mmWave. Through the IoT-language contrastive learning, we derive a unified semantic feature space that aligns multi-modal features with language embeddings, so that the IoT data corresponds to specific words that describe the IoT data. To enhance the connection between textual categories and their IoT data, we propose supplementary descriptions and learnable prompts that bring more semantic information into the joint feature space. TENT can not only recognize actions that have been seen but also ``guess'' the unseen action by the closest textual words from the feature space. We demonstrate TENT achieves state-of-the-art performance on zero-shot HAR tasks using different modalities, improving the best vision-language models by over 12%.Comment: Preprint manuscript in submissio

    Recognition, Analysis, and Assessments of Human Skills using Wearable Sensors

    Get PDF
    One of the biggest social issues in mature societies such as Europe and Japan is the aging population and declining birth rate. These societies have a serious problem with the retirement of the expert workers, doctors, and engineers etc. Especially in the sectors that require long time to make experts in fields like medicine and industry; the retirement and injuries of the experts, is a serious problem. The technology to support the training and assessment of skilled workers (like doctors, manufacturing workers) is strongly required for the society. Although there are some solutions for this problem, most of them are video-based which violates the privacy of the subjects. Furthermore, they are not easy to deploy due to the need for large training data. This thesis provides a novel framework to recognize, analyze, and assess human skills with minimum customization cost. The presented framework tackles this problem in two different domains, industrial setup and medical operations of catheter-based cardiovascular interventions (CBCVI). In particular, the contributions of this thesis are four-fold. First, it proposes an easy-to-deploy framework for human activity recognition based on zero-shot learning approach, which is based on learning basic actions and objects. The model recognizes unseen activities by combinations of basic actions learned in a preliminary way and involved objects. Therefore, it is completely configurable by the user and can be used to detect completely new activities. Second, a novel gaze-estimation model for attention driven object detection task is presented. The key features of the model are: (i) usage of the deformable convolutional layers to better incorporate spatial dependencies of different shapes of objects and backgrounds, (ii) formulation of the gaze-estimation problem in two different way, as a classification as well as a regression problem. We combine both formulations using a joint loss that incorporates both the cross-entropy as well as the mean-squared error in order to train our model. This enhanced the accuracy of the model from 6.8 by using only the cross-entropy loss to 6.4 for the joint loss. The third contribution of this thesis targets the area of quantification of quality of i actions using wearable sensor. To address the variety of scenarios, we have targeted two possibilities: a) both expert and novice data is available , b) only expert data is available, a quite common case in safety critical scenarios. Both of the developed methods from these scenarios are deep learning based. In the first one, we use autoencoders with OneClass SVM, and in the second one we use the Siamese Networks. These methods allow us to encode the expert’s expertise and to learn the differences between novice and expert workers. This enables quantification of the performance of the novice in comparison to the expert worker. The fourth contribution, explicitly targets medical practitioners and provides a methodology for novel gaze-based temporal spatial analysis of CBCVI data. The developed methodology allows continuous registration and analysis of gaze data for analysis of the visual X-ray image processing (XRIP) strategies of expert operators in live-cases scenarios and may assist in transferring experts’ reading skills to novices

    NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding

    Full text link
    Research on depth-based human activity analysis achieved outstanding performance and demonstrated the effectiveness of 3D representation for action recognition. The existing depth-based and RGB+D-based action recognition benchmarks have a number of limitations, including the lack of large-scale training samples, realistic number of distinct class categories, diversity in camera views, varied environmental conditions, and variety of human subjects. In this work, we introduce a large-scale dataset for RGB+D human action recognition, which is collected from 106 distinct subjects and contains more than 114 thousand video samples and 8 million frames. This dataset contains 120 different action classes including daily, mutual, and health-related activities. We evaluate the performance of a series of existing 3D activity analysis methods on this dataset, and show the advantage of applying deep learning methods for 3D-based human action recognition. Furthermore, we investigate a novel one-shot 3D activity recognition problem on our dataset, and a simple yet effective Action-Part Semantic Relevance-aware (APSR) framework is proposed for this task, which yields promising results for recognition of the novel action classes. We believe the introduction of this large-scale dataset will enable the community to apply, adapt, and develop various data-hungry learning techniques for depth-based and RGB+D-based human activity understanding. [The dataset is available at: http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI

    Cross-modal learning from visual information for activity recognition on inertial sensors

    Get PDF
    The lack of large-scale, labeled datasets impedes progress in developing robust and generalized predictive models for human activity recognition (HAR) from wearable inertial sensor data. Labeled data is scarce as sensor data collection is expensive, and their annotation is time-consuming and error-prone. As a result, public inertial HAR datasets are small in terms of number of subjects, activity classes, hours of recorded data, and variation in recorded environments. Machine learning models, developed using these small datasets, are effectively blind to the diverse expressions of activities performed by wide-ranging populations in the real world, and progress in wearable inertial sensing is held back by this bottleneck for activity understanding. . But just as Internet-scale text, image and audio data have pushed their respective pattern recognition fields to systems reliable enough for everyday use, easy access to large quantities of data can push forward the field of inertial HAR, and by extension wearable sensing. To this end, this thesis pioneers the idea of exploiting the visual modality as a source domain for cross-modal learning, such that data and knowledge can be transferred across to benefit the target domain of inertial HAR. . This thesis makes three contributions to inertial HAR through cross-modal approaches. First, to overcome the barrier of expensive inertial data collection and annotation, we contribute a novel pipeline that automatically extracts virtual accelerometer data from videos of human activities, which are readily annotated and accessible in large quantities. Second, we propose acquiring transferable representations about activities, from HAR models trained using large quantities of visual data to enrich the development of inertial HAR models. Finally, the third contribution exposes HAR models to the challenging setting of zero-shot learning; we propose mechanisms that leverage cross-modal correspondence to enable inference on previously unseen classes. . Unlike prior approaches, this body of work pushes forward the state of the art in HAR not by exhausting resources concentrated in the inertial domain, but by exploiting an existing, resourceful, intuitive, and informative source, the visual domain. These contributions represent a new line of cross-modal thinking in inertial HAR, and suggest important future directions for inertial-based wearable sensing research

    Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity Recognition

    Full text link
    To properly assist humans in their needs, human activity recognition (HAR) systems need the ability to fuse information from multiple modalities. Our hypothesis is that multimodal sensors, visual and non-visual tend to provide complementary information, addressing the limitations of other modalities. In this work, we propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors, and show its robustness for MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features, and in the second stage, learns to combine these individual features. We show significant improvements of 22% and 11% compared to video only and IMU only setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive experimentation, we show the robustness of our model on zero shot setting, and limited annotated data setting. We further compare with state-of-the-art methods that use more input modalities and show that our method outperforms significantly on the more difficult MMact dataset, and performs comparably in UTD-MHAD dataset

    AmicroN: A Framework for Generating Annotations for Human Activity Recognition with Granular Micro-Activities

    Full text link
    Efficient human activity recognition (HAR) using sensor data needs a significant volume of annotated data. The growing volume of unlabelled sensor data has challenged conventional practices for gathering HAR annotations with human-in-the-loop approaches, often leading to the collection of shallower annotations. These shallower annotations ignore the fine-grained micro-activities that constitute any complex activities of daily living (ADL). Understanding this, we, in this paper, first analyze this lack of granular annotations from available pre-annotated datasets to understand the practical inconsistencies and also perform a detailed survey to look into the human perception surrounding annotations. Drawing motivations from these, we next develop the framework AmicroN that can automatically generate micro-activity annotations using locomotive signatures and the available coarse-grain macro-activity labels. In the backend, AmicroN applies change-point detection followed by zero-shot learning with activity embeddings to identify the unseen micro-activities in an unsupervised manner. Rigorous evaluation on publicly available datasets shows that AmicroN can accurately generate micro-activity annotations with a median F1-score of >0.75. Additionally, we also show that AmicroN can be used in a plug-and-play manner with Large Language Models (LLMs) to obtain the micro-activity labels, thus making it more practical for realistic applications.Comment: 27 pages, 5 tables, 9 figure
    • …