Search CORE

4,506 research outputs found

Vision systems with the human in the loop

Author: Bauckhage Christian
Hanheide Marc
Kaster Thomas
Pfeiffer Michael
Sagerer Gerhard
Wrede Sebastian
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2005
Field of study

The emerging cognitive vision paradigm deals with vision systems that apply machine learning and automatic reasoning in order to learn from what they perceive. Cognitive vision systems can rate the relevance and consistency of newly acquired knowledge, they can adapt to their environment and thus will exhibit high robustness. This contribution presents vision systems that aim at flexibility and robustness. One is tailored for content-based image retrieval, the others are cognitive vision systems that constitute prototypes of visual active memories which evaluate, gather, and integrate contextual knowledge for visual analysis. All three systems are designed to interact with human users. After we will have discussed adaptive content-based image retrieval and object and action recognition in an office environment, the issue of assessing cognitive systems will be raised. Experiences from psychologically evaluated human-machine interactions will be reported and the promising potential of psychologically-based usability experiments will be stressed

University of Lincoln Institutional Repository

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Publications at Bielefeld University

MirBot: A collaborative object recognition system for smartphones using convolutional neural networks

Author: Bernabeu Marisa
Gallego Antonio-Javier
Pertusa Antonio
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

MirBot is a collaborative application for smartphones that allows users to perform object recognition. This app can be used to take a photograph of an object, select the region of interest and obtain the most likely class (dog, chair, etc.) by means of similarity search using features extracted from a convolutional neural network (CNN). The answers provided by the system can be validated by the user so as to improve the results for future queries. All the images are stored together with a series of metadata, thus enabling a multimodal incremental dataset labeled with synset identifiers from the WordNet ontology. This dataset grows continuously thanks to the users' feedback, and is publicly available for research. This work details the MirBot object recognition system, analyzes the statistics gathered after more than four years of usage, describes the image classification methodology, and performs an exhaustive evaluation using handcrafted features, convolutional neural codes and different transfer learning techniques. After comparing various models and transformation methods, the results show that the CNN features maintain the accuracy of MirBot constant over time, despite the increasing number of new classes. The app is freely available at the Apple and Google Play stores.Comment: Accepted in Neurocomputing, 201

arXiv.org e-Print Archive

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

Author: Liang Paul Pu
Morency Louis-Philippe
Zadeh Amir
Publication venue
Publication date: 07/09/2022
Field of study

Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining two key principles of modality heterogeneity and interconnections that have driven subsequent innovations, and propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy

arXiv.org e-Print Archive

Multimodal recognition of frustration during game-play with deep neural networks

Author: Calvo-Zaragoza Jorge
Castellanos Francisco J.
Fuente Torres Carlos de la
Valero-Mas Jose J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/09/2022
Field of study

Frustration, which is one aspect of the field of emotional recognition, is of particular interest to the video game industry as it provides information concerning each individual player’s level of engagement. The use of non-invasive strategies to estimate this emotion is, therefore, a relevant line of research with a direct application to real-world scenarios. While several proposals regarding the performance of non-invasive frustration recognition can be found in literature, they usually rely on hand-crafted features and rarely exploit the potential inherent to the combination of different sources of information. This work, therefore, presents a new approach that automatically extracts meaningful descriptors from individual audio and video sources of information using Deep Neural Networks (DNN) in order to then combine them, with the objective of detecting frustration in Game-Play scenarios. More precisely, two fusion modalities, namely decision-level and feature-level, are presented and compared with state-of-the-art methods, along with different DNN architectures optimized for each type of data. Experiments performed with a real-world audiovisual benchmarking corpus revealed that the multimodal proposals introduced herein are more suitable than those of a unimodal nature, and that their performance also surpasses that of other state-of-the–art approaches, with error rate improvements of between 40% and 90%.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. The first author acknowledges the support from the Spanish “Ministerio de Educación y Formación Profesional” through grant 20CO1/000966. The second and third authors acknowledge support from the “Programa I+D+i de la Generalitat Valenciana” through grants ACIF/2019/042 and APOSTD/2020/256, respectively

Repositorio Institucional de la Universidad de Alicante