Search CORE

249,831 research outputs found

Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition

Author: Ardila Diego
Cadieu Charles F.
DiCarlo James J.
Hong Ha
Majaj Najib J.
Pinto Nicolas
Solomon Ethan A.
Yamins Daniel L. K.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 12/06/2014
Field of study

The primate visual system achieves remarkable visual object recognition performance even in brief presentations and under changes to object exemplar, geometric transformations, and background variation (a.k.a. core visual object recognition). This remarkable performance is mediated by the representation formed in inferior temporal (IT) cortex. In parallel, recent advances in machine learning have led to ever higher performing models of object recognition using artificial deep neural networks (DNNs). It remains unclear, however, whether the representational performance of DNNs rivals that of the brain. To accurately produce such a comparison, a major difficulty has been a unifying metric that accounts for experimental limitations such as the amount of noise, the number of neural recording sites, and the number trials, and computational limitations such as the complexity of the decoding classifier and the number of classifier training examples. In this work we perform a direct comparison that corrects for these experimental limitations and computational considerations. As part of our methodology, we propose an extension of "kernel analysis" that measures the generalization accuracy as a function of representational complexity. Our evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task. Furthermore, we show that models that perform well on measures of representational performance also perform well on measures of representational similarity to IT and on measures of predicting individual IT multi-unit responses. Whether these DNNs rely on computational mechanisms similar to the primate visual system is yet to be determined, but, unlike all previous bio-inspired models, that possibility cannot be ruled out merely on representational performance grounds.Comment: 35 pages, 12 figures, extends and expands upon arXiv:1301.353

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Directory of Open Access Journals

PubMed Central

FigShare

Active learning with gaussian processes for object categorization

Author: Ashish Kapoor
Kristen Grauman
Raquel Urtasun
Trevor Darrell
Publication venue
Publication date: 01/01/2007
Field of study

Discriminative methods for visual object category recognition are typically non-probabilistic, predicting class labels but not directly providing an estimate of uncertainty. Gaussian Processes (GPs) are powerful regression techniques with explicit uncertainty models; we show here how Gaussian Processes with covariance functions defined based on a Pyramid Match Kernel (PMK) can be used for probabilistic object category recognition. The uncertainty model provided by GPs offers confidence estimates at test points, and naturally allows for an active learning paradigm in which points are optimally selected for interactive labeling. We derive a novel active category learning method based on our probabilistic regression model, and show that a significant boost in classification performance is possible, especially when the amount of training data for a category is ultimately very small. 1

CiteSeerX

Crossref

A Comparative Analysis of Machine Learning Methods for Lane Change Intention Recognition Using Vehicle Trajectory Data

Author: Yuan Renteng
Publication venue
Publication date: 28/07/2023
Field of study

Accurately detecting and predicting lane change (LC)processes can help autonomous vehicles better understand their surrounding environment, recognize potential safety hazards, and improve traffic safety. This paper focuses on LC processes and compares different machine learning methods' performance to recognize LC intention from high-dimensionality time series data. To validate the performance of the proposed models, a total number of 1023 vehicle trajectories is extracted from the CitySim dataset. For LC intention recognition issues, the results indicate that with ninety-eight percent of classification accuracy, ensemble methods reduce the impact of Type II and Type III classification errors. Without sacrificing recognition accuracy, the LightGBM demonstrates a sixfold improvement in model training efficiency than the XGBoost algorithm.Comment: arXiv admin note: text overlap with arXiv:2304.1373

arXiv.org e-Print Archive

FACE READERS: The Frontier of Computer Vision and Math Learning

Author: Allessio Danielle
Arroyo Ivan
Bargal Sarah Adel
Betke Margrit
Magee John J, IV
Rebelsky William
Woolf Beverly
Yu Hao
Publication venue: Clark Digital Commons
Publication date: 03/07/2023
Field of study

The future of AI-assisted individualized learning includes computer vision to inform intelligent tutors and teachers about student affect, motivation and performance. Facial expression recognition is essential in recognizing subtle differences when students ask for hints or fail to solve problems. Facial features and classification labels enable intelligent tutors to predict students’ performance and recommend activities. Videos can capture students’ faces and model their effort and progress; machine learning classifiers can support intelligent tutors to provide interventions. One goal of this research is to support deep dives by teachers to identify students’ individual needs through facial expression and to provide immediate feedback. Another goal is to develop data-directed education to gauge students’ pre-existing knowledge and analyze real-time data that will engage both teachers and students in more individualized and precision teaching and learning. This paper identifies three phases in the process of recognizing and predicting student progress based on analyzing facial features: Phase I: Collecting datasets and identifying salient labels for facial features and student attention/engagement; Phase II: Building and training deep learning models of facial features; and Phase III: Predicting student problem-solving outcome. © 2023 Copyright for this paper by its authors

Clark University

Analysing acoustic model changes for active learning in automatic speech recognition

Author: Hain T.
Ng R.W.M.
Torralba O.S.
Wu C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2017
Field of study

In active learning for Automatic Speech Recognition (ASR), a portion of data is automatically selected for manual transcription. The objective is to improve ASR performance with retrained acoustic models. The standard approaches are based on confidence of individual sentences. In this study, we look into an alternative view on transcript label quality, in which Gaussian Supervector Distance (GSD) is used as a criterion for data selection. GSD is a metric which quantifies how the model was changed during its adaptation. By using an automatic speech recognition transcript derived from an out-of-domain acoustic model, unsupervised adaptation was conducted and GSD was computed. The adapted model is then applied to an audio book transcription task. It is found that GSD provide hints for predicting data transcription quality. A preliminary attempt in active learning proves the effectiveness of GSD selection criterion over random selection, shedding light on its prospective use

White Rose Research Online

Self-Supervised Learning for Audio-Based Emotion Recognition

Author: Nimitsurachat Peranut
Washington Peter
Publication venue
Publication date: 23/07/2023
Field of study

Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data. Our model is first pretrained to uncover the randomly-masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via several evaluation metrics against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics. This work shows the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small, and that the effect is most pronounced for emotions which are easier to classify such as happy, sad and anger. This work further demonstrates that self-supervised learning works when applied to embedded feature representations rather than the traditional approach of pre-training on the raw input space.Comment: 8 pages, 9 figures, submitted to IEEE Transactions on Affective Computin

arXiv.org e-Print Archive

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Author: Chen Jiajing
Hasan Md Zahid
Hegde Chinmay
Joshi Ameya
Sarkar Soumik
Sharma Anuj
Velipasalar Senem
Wang Jiyang
Publication venue
Publication date: 16/06/2023
Field of study

Recognizing the activities, causing distraction, in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification task and report the results.Comment: 15 pages, 10 figure

arXiv.org e-Print Archive