249,831 research outputs found
Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition
The primate visual system achieves remarkable visual object recognition
performance even in brief presentations and under changes to object exemplar,
geometric transformations, and background variation (a.k.a. core visual object
recognition). This remarkable performance is mediated by the representation
formed in inferior temporal (IT) cortex. In parallel, recent advances in
machine learning have led to ever higher performing models of object
recognition using artificial deep neural networks (DNNs). It remains unclear,
however, whether the representational performance of DNNs rivals that of the
brain. To accurately produce such a comparison, a major difficulty has been a
unifying metric that accounts for experimental limitations such as the amount
of noise, the number of neural recording sites, and the number trials, and
computational limitations such as the complexity of the decoding classifier and
the number of classifier training examples. In this work we perform a direct
comparison that corrects for these experimental limitations and computational
considerations. As part of our methodology, we propose an extension of "kernel
analysis" that measures the generalization accuracy as a function of
representational complexity. Our evaluations show that, unlike previous
bio-inspired models, the latest DNNs rival the representational performance of
IT cortex on this visual object recognition task. Furthermore, we show that
models that perform well on measures of representational performance also
perform well on measures of representational similarity to IT and on measures
of predicting individual IT multi-unit responses. Whether these DNNs rely on
computational mechanisms similar to the primate visual system is yet to be
determined, but, unlike all previous bio-inspired models, that possibility
cannot be ruled out merely on representational performance grounds.Comment: 35 pages, 12 figures, extends and expands upon arXiv:1301.353
Active learning with gaussian processes for object categorization
Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) are powerful regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. The uncertainty model provided by GPs offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We derive a novel active category learning method based on our probabilistic regression model, and show that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small. 1
A Comparative Analysis of Machine Learning Methods for Lane Change Intention Recognition Using Vehicle Trajectory Data
Accurately detecting and predicting lane change (LC)processes can help
autonomous vehicles better understand their surrounding environment, recognize
potential safety hazards, and improve traffic safety. This paper focuses on LC
processes and compares different machine learning methods' performance to
recognize LC intention from high-dimensionality time series data. To validate
the performance of the proposed models, a total number of 1023 vehicle
trajectories is extracted from the CitySim dataset. For LC intention
recognition issues, the results indicate that with ninety-eight percent of
classification accuracy, ensemble methods reduce the impact of Type II and Type
III classification errors. Without sacrificing recognition accuracy, the
LightGBM demonstrates a sixfold improvement in model training efficiency than
the XGBoost algorithm.Comment: arXiv admin note: text overlap with arXiv:2304.1373
FACE READERS: The Frontier of Computer Vision and Math Learning
The future of AI-assisted individualized learning includes computer vision to inform intelligent tutors and teachers about student affect, motivation and performance. Facial expression recognition is essential in recognizing subtle differences when students ask for hints or fail to solve problems. Facial features and classification labels enable intelligent tutors to predict students’ performance and recommend activities. Videos can capture students’ faces and model their effort and progress; machine learning classifiers can support intelligent tutors to provide interventions. One goal of this research is to support deep dives by teachers to identify students’ individual needs through facial expression and to provide immediate feedback. Another goal is to develop data-directed education to gauge students’ pre-existing knowledge and analyze real-time data that will engage both teachers and students in more individualized and precision teaching and learning. This paper identifies three phases in the process of recognizing and predicting student progress based on analyzing facial features: Phase I: Collecting datasets and identifying salient labels for facial features and student attention/engagement; Phase II: Building and training deep learning models of facial features; and Phase III: Predicting student problem-solving outcome. © 2023 Copyright for this paper by its authors
Analysing acoustic model changes for active learning in automatic speech recognition
In active learning for Automatic Speech Recognition
(ASR), a portion of data is automatically selected for manual
transcription. The objective is to improve ASR performance with
retrained acoustic models. The standard approaches are based
on confidence of individual sentences. In this study, we look
into an alternative view on transcript label quality, in which
Gaussian Supervector Distance (GSD) is used as a criterion
for data selection. GSD is a metric which quantifies how the
model was changed during its adaptation. By using an automatic
speech recognition transcript derived from an out-of-domain
acoustic model, unsupervised adaptation was conducted and GSD
was computed. The adapted model is then applied to an audio
book transcription task. It is found that GSD provide hints for
predicting data transcription quality. A preliminary attempt in
active learning proves the effectiveness of GSD selection criterion
over random selection, shedding light on its prospective use
Self-Supervised Learning for Audio-Based Emotion Recognition
Emotion recognition models using audio input data can enable the development
of interactive systems with applications in mental healthcare, marketing,
gaming, and social media analysis. While the field of affective computing using
audio data is rich, a major barrier to achieve consistently high-performance
models is the paucity of available training labels. Self-supervised learning
(SSL) is a family of methods which can learn despite a scarcity of supervised
labels by predicting properties of the data itself. To understand the utility
of self-supervised learning for audio-based emotion recognition, we have
applied self-supervised learning pre-training to the classification of emotions
from the CMU- MOSEI's acoustic modality. Unlike prior papers that have
experimented with raw acoustic data, our technique has been applied to encoded
acoustic data. Our model is first pretrained to uncover the randomly-masked
timestamps of the acoustic data. The pre-trained model is then fine-tuned using
a small sample of annotated data. The performance of the final model is then
evaluated via several evaluation metrics against a baseline deep learning model
with an identical backbone architecture. We find that self-supervised learning
consistently improves the performance of the model across all metrics. This
work shows the utility of self-supervised learning for affective computing,
demonstrating that self-supervised learning is most useful when the number of
training examples is small, and that the effect is most pronounced for emotions
which are easier to classify such as happy, sad and anger. This work further
demonstrates that self-supervised learning works when applied to embedded
feature representations rather than the traditional approach of pre-training on
the raw input space.Comment: 8 pages, 9 figures, submitted to IEEE Transactions on Affective
Computin
Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos
Recognizing the activities, causing distraction, in real-world driving
scenarios is critical for ensuring the safety and reliability of both drivers
and pedestrians on the roadways. Conventional computer vision techniques are
typically data-intensive and require a large volume of annotated training data
to detect and classify various distracted driving behaviors, thereby limiting
their efficiency and scalability. We aim to develop a generalized framework
that showcases robust performance with access to limited or no annotated
training data. Recently, vision-language models have offered large-scale
visual-textual pretraining that can be adapted to task-specific learning like
distracted driving activity recognition. Vision-language pretraining models,
such as CLIP, have shown significant promise in learning natural
language-guided visual representations. This paper proposes a CLIP-based driver
activity recognition approach that identifies driver distraction from
naturalistic driving images and videos. CLIP's vision embedding offers
zero-shot transfer and task-based finetuning, which can classify distracted
activities from driving video data. Our results show that this framework offers
state-of-the-art performance on zero-shot transfer and video-based CLIP for
predicting the driver's state on two public datasets. We propose both
frame-based and video-based frameworks developed on top of the CLIP's visual
representation for distracted driving detection and classification task and
report the results.Comment: 15 pages, 10 figure
- …