236,079 research outputs found
CORe50: a New Dataset and Benchmark for Continuous Object Recognition
Continuous/Lifelong learning of high-dimensional data streams is a
challenging research problem. In fact, fully retraining models each time new
data become available is infeasible, due to computational and storage issues,
while na\"ive incremental strategies have been shown to suffer from
catastrophic forgetting. In the context of real-world object recognition
applications (e.g., robotic vision), where continuous learning is crucial, very
few datasets and benchmarks are available to evaluate and compare emerging
techniques. In this work we propose a new dataset and benchmark CORe50,
specifically designed for continuous object recognition, and introduce baseline
approaches for different continuous learning scenarios
A Developmental Neuro-Robotics Approach for Boosting the Recognition of Handwritten Digits
Developmental psychology and neuroimaging
research identified a close link between numbers and fingers,
which can boost the initial number knowledge in children. Recent
evidence shows that a simulation of the children's embodied
strategies can improve the machine intelligence too. This article
explores the application of embodied strategies to convolutional
neural network models in the context of developmental neurorobotics, where the training information is likely to be gradually
acquired while operating rather than being abundant and fully
available as the classical machine learning scenarios. The
experimental analyses show that the proprioceptive information
from the robot fingers can improve network accuracy in the
recognition of handwritten Arabic digits when training examples
and epochs are few. This result is comparable to brain imaging
and longitudinal studies with young children. In conclusion, these
findings also support the relevance of the embodiment in the case
of artificial agents’ training and show a possible way for the
humanization of the learning process, where the robotic body can
express the internal processes of artificial intelligence making it
more understandable for humans
Neuro-Symbolic Approaches for Context-Aware Human Activity Recognition
Deep Learning models are a standard solution for sensor-based Human Activity
Recognition (HAR), but their deployment is often limited by labeled data
scarcity and models' opacity. Neuro-Symbolic AI (NeSy) provides an interesting
research direction to mitigate these issues by infusing knowledge about context
information into HAR deep learning classifiers. However, existing NeSy methods
for context-aware HAR require computationally expensive symbolic reasoners
during classification, making them less suitable for deployment on
resource-constrained devices (e.g., mobile devices). Additionally, NeSy
approaches for context-aware HAR have never been evaluated on in-the-wild
datasets, and their generalization capabilities in real-world scenarios are
questionable. In this work, we propose a novel approach based on a semantic
loss function that infuses knowledge constraints in the HAR model during the
training phase, avoiding symbolic reasoning during classification. Our results
on scripted and in-the-wild datasets show the impact of different semantic loss
functions in outperforming a purely data-driven model. We also compare our
solution with existing NeSy methods and analyze each approach's strengths and
weaknesses. Our semantic loss remains the only NeSy solution that can be
deployed as a single DNN without the need for symbolic reasoning modules,
reaching recognition rates close (and better in some cases) to existing
approaches
Uncovering the Hidden Dynamics of Video Self-supervised Learning under Distribution Shifts
Video self-supervised learning (VSSL) has made significant progress in recent
years. However, the exact behavior and dynamics of these models under different
forms of distribution shift are not yet known. In this paper, we
comprehensively study the behavior of six popular self-supervised methods
(v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various
forms of natural distribution shift, i.e., (i) context shift, (ii) viewpoint
shift, (iii) actor shift, (iv) source shift, (v) generalizability to unknown
classes (zero-shot), and (vi) open-set recognition. To perform this extensive
study, we carefully craft a test bed consisting of 17 in-distribution and
out-of-distribution benchmark pairs using available public datasets and a
series of evaluation protocols to stress-test the different methods under the
intended shifts. Our study uncovers a series of intriguing findings and
interesting behaviors of VSSL methods. For instance, we observe that while
video models generally struggle with context shifts, v-MAE and supervised
learning exhibit more robustness. Moreover, our study shows that v-MAE is a
strong temporal learner, whereas contrastive methods, v-SimCLR and v-MoCo,
exhibit strong performances against viewpoint shifts. When studying the notion
of open-set recognition, we notice a trade-off between closed-set and open-set
recognition performance if the pretrained VSSL encoders are used without
finetuning. We hope that our work will contribute to the development of robust
video representation learning frameworks for various real-world scenarios. The
project page and code are available at: https://pritamqu.github.io/OOD-VSSL.Comment: NeurIPS 2023 Spotligh
On-device modeling of user's social context and familiar places from smartphone-embedded sensor data
Context modeling and recognition are crucial for adaptive mobile and
ubiquitous computing. Context-awareness in mobile environments relies on prompt
reactions to context changes. However, current solutions focus on limited
context information processed on centralized architectures, risking privacy
leakage and lacking personalization. On-device context modeling and recognition
are emerging research trends, addressing these concerns. Social interactions
and visited locations play significant roles in characterizing daily life
scenarios. This paper proposes an unsupervised and lightweight approach to
model the user's social context and locations directly on the mobile device.
Leveraging the ego-network model, the system extracts high-level, semantic-rich
context features from smartphone-embedded sensor data. For the social context,
the approach utilizes data on physical and cyber social interactions among
users and their devices. Regarding location, it prioritizes modeling the
familiarity degree of specific locations over raw location data, such as GPS
coordinates and proximity devices. The effectiveness of the proposed approach
is demonstrated through three sets of experiments, employing five real-world
datasets. These experiments evaluate the structure of social and location ego
networks, provide a semantic evaluation of the proposed models, and assess
mobile computing performance. Finally, the relevance of the extracted features
is showcased by the improved performance of three machine learning models in
recognizing daily-life situations. Compared to using only features related to
physical context, the proposed approach achieves a 3% improvement in AUROC, 9%
in Precision, and 5% in Recall
Provable Robustness for Streaming Models with a Sliding Window
The literature on provable robustness in machine learning has primarily
focused on static prediction problems, such as image classification, in which
input samples are assumed to be independent and model performance is measured
as an expectation over the input distribution. Robustness certificates are
derived for individual input instances with the assumption that the model is
evaluated on each instance separately. However, in many deep learning
applications such as online content recommendation and stock market analysis,
models use historical data to make predictions. Robustness certificates based
on the assumption of independent input samples are not directly applicable in
such scenarios. In this work, we focus on the provable robustness of machine
learning models in the context of data streams, where inputs are presented as a
sequence of potentially correlated items. We derive robustness certificates for
models that use a fixed-size sliding window over the input stream. Our
guarantees hold for the average model performance across the entire stream and
are independent of stream size, making them suitable for large data streams. We
perform experiments on speech detection and human activity recognition tasks
and show that our certificates can produce meaningful performance guarantees
against adversarial perturbations
Design and implementation of a convolutional neural network on an edge computing smartphone for human activity recognition
Edge computing aims to integrate computing into everyday settings, enabling the system to be context-aware and private to the user. With the increasing success and popularity of deep learning methods, there is an increased demand to leverage these techniques in mobile and wearable computing scenarios. In this paper, we present an assessment of a deep human activity recognition system’s memory and execution time requirements, when implemented on a mid-range smartphone class hardware and the memory implications for embedded hardware. This paper presents the design of a convolutional neural network (CNN) in the context of human activity recognition scenario. Here, layers of CNN automate the feature learning and the influence of various hyper-parameters such as the number of filters and filter size on the performance of CNN. The proposed CNN showed increased robustness with better capability of detecting activities with temporal dependence compared to models using statistical machine learning techniques. The model obtained an accuracy of 96.4% in a five-class static and dynamic activity recognition scenario. We calculated the proposed model memory consumption and execution time requirements needed for using it on a mid-range smartphone. Per-channel quantization of weights and per-layer quantization of activation to 8-bits of precision post-training produces classification accuracy within 2% of floating-point networks for dense, convolutional neural network architecture. Almost all the size and execution time reduction in the optimized model was achieved due to weight quantization. We achieved more than four times reduction in model size when optimized to 8-bit, which ensured a feasible model capable of fast on-device inference
Attention Mechanism for Recognition in Computer Vision
It has been proven that humans do not focus their attention on an entire scene at once when they perform a recognition task. Instead, they pay attention to the most important parts of the scene to extract the most discriminative information. Inspired by this observation, in this dissertation, the importance of attention mechanism in recognition tasks in computer vision is studied by designing novel attention-based models. In specific, four scenarios are investigated that represent the most important aspects of attention mechanism.First, an attention-based model is designed to reduce the visual features\u27 dimensionality by selectively processing only a small subset of the data. We study this aspect of the attention mechanism in a framework based on object recognition in distributed camera networks. Second, an attention-based image retrieval system (i.e., person re-identification) is proposed which learns to focus on the most discriminative regions of the person\u27s image and process those regions with higher computation power using a deep convolutional neural network. Furthermore, we show how visualizing the attention maps can make deep neural networks more interpretable. In other words, by visualizing the attention maps we can observe the regions of the input image where the neural network relies on, in order to make a decision. Third, a model for estimating the importance of the objects in a scene based on a given task is proposed. More specifically, the proposed model estimates the importance of the road users that a driver (or an autonomous vehicle) should pay attention to in a driving scenario in order to have safe navigation. In this scenario, the attention estimation is the final output of the model. Fourth, an attention-based module and a new loss function in a meta-learning based few-shot learning system is proposed in order to incorporate the context of the task into the feature representations of the samples and increasing the few-shot recognition accuracy.In this dissertation, we showed that attention can be multi-facet and studied the attention mechanism from the perspectives of feature selection, reducing the computational cost, interpretable deep learning models, task-driven importance estimation, and context incorporation. Through the study of four scenarios, we further advanced the field of where \u27\u27attention is all you need\u27\u27
- …