641 research outputs found

    Data-centric Design and Training of Deep Neural Networks with Multiple Data Modalities for Vision-based Perception Systems

    Get PDF
    224 p.Los avances en visión artificial y aprendizaje automático han revolucionado la capacidad de construir sistemas que procesen e interpreten datos digitales, permitiéndoles imitar la percepción humana y abriendo el camino a un amplio rango de aplicaciones. En los últimos años, ambas disciplinas han logrado avances significativos,impulsadas por los progresos en las técnicas de aprendizaje profundo(deep learning). El aprendizaje profundo es una disciplina que utiliza redes neuronales profundas (DNNs, por sus siglas en inglés) para enseñar a las máquinas a reconocer patrones y hacer predicciones basadas en datos. Los sistemas de percepción basados en el aprendizaje profundo son cada vez más frecuentes en diversos campos, donde humanos y máquinas colaboran para combinar sus fortalezas.Estos campos incluyen la automoción, la industria o la medicina, donde mejorar la seguridad, apoyar el diagnóstico y automatizar tareas repetitivas son algunos de los objetivos perseguidos.Sin embargo, los datos son uno de los factores clave detrás del éxito de los algoritmos de aprendizaje profundo. La dependencia de datos limita fuertemente la creación y el éxito de nuevas DNN. La disponibilidad de datos de calidad para resolver un problema específico es esencial pero difícil de obtener, incluso impracticable,en la mayoría de los desarrollos. La inteligencia artificial centrada en datos enfatiza la importancia de usar datos de alta calidad que transmitan de manera efectiva lo que un modelo debe aprender. Motivada por los desafíos y la necesidad de los datos, esta tesis formula y valida cinco hipótesis sobre la adquisición y el impacto de los datos en el diseño y entrenamiento de las DNNs.Específicamente, investigamos y proponemos diferentes metodologías para obtener datos adecuados para entrenar DNNs en problemas con acceso limitado a fuentes de datos de gran escala. Exploramos dos posibles soluciones para la obtención de datos de entrenamiento, basadas en la generación de datos sintéticos. En primer lugar, investigamos la generación de datos sintéticos utilizando gráficos 3D y el impacto de diferentes opciones de diseño en la precisión de los DNN obtenidos. Además, proponemos una metodología para automatizar el proceso de generación de datos y producir datos anotados variados, mediante la replicación de un entorno 3D personalizado a partir de un archivo de configuración de entrada. En segundo lugar, proponemos una red neuronal generativa(GAN) que genera imágenes anotadas utilizando conjuntos de datos anotados limitados y datos sin anotaciones capturados en entornos no controlados

    An original framework for understanding human actions and body language by using deep neural networks

    Get PDF
    The evolution of both fields of Computer Vision (CV) and Artificial Neural Networks (ANNs) has allowed the development of efficient automatic systems for the analysis of people's behaviour. By studying hand movements it is possible to recognize gestures, often used by people to communicate information in a non-verbal way. These gestures can also be used to control or interact with devices without physically touching them. In particular, sign language and semaphoric hand gestures are the two foremost areas of interest due to their importance in Human-Human Communication (HHC) and Human-Computer Interaction (HCI), respectively. While the processing of body movements play a key role in the action recognition and affective computing fields. The former is essential to understand how people act in an environment, while the latter tries to interpret people's emotions based on their poses and movements; both are essential tasks in many computer vision applications, including event recognition, and video surveillance. In this Ph.D. thesis, an original framework for understanding Actions and body language is presented. The framework is composed of three main modules: in the first one, a Long Short Term Memory Recurrent Neural Networks (LSTM-RNNs) based method for the Recognition of Sign Language and Semaphoric Hand Gestures is proposed; the second module presents a solution based on 2D skeleton and two-branch stacked LSTM-RNNs for action recognition in video sequences; finally, in the last module, a solution for basic non-acted emotion recognition by using 3D skeleton and Deep Neural Networks (DNNs) is provided. The performances of RNN-LSTMs are explored in depth, due to their ability to model the long term contextual information of temporal sequences, making them suitable for analysing body movements. All the modules were tested by using challenging datasets, well known in the state of the art, showing remarkable results compared to the current literature methods

    A Review and Analysis of Eye-Gaze Estimation Systems, Algorithms and Performance Evaluation Methods in Consumer Platforms

    Full text link
    In this paper a review is presented of the research on eye gaze estimation techniques and applications, that has progressed in diverse ways over the past two decades. Several generic eye gaze use-cases are identified: desktop, TV, head-mounted, automotive and handheld devices. Analysis of the literature leads to the identification of several platform specific factors that influence gaze tracking accuracy. A key outcome from this review is the realization of a need to develop standardized methodologies for performance evaluation of gaze tracking systems and achieve consistency in their specification and comparative evaluation. To address this need, the concept of a methodological framework for practical evaluation of different gaze tracking systems is proposed.Comment: 25 pages, 13 figures, Accepted for publication in IEEE Access in July 201

    Predicting user confidence during visual decision making

    Get PDF
    © 2018 ACM People are not infallible consistent “oracles”: their confidence in decision-making may vary significantly between tasks and over time. We have previously reported the benefits of using an interface and algorithms that explicitly captured and exploited users’ confidence: error rates were reduced by up to 50% for an industrial multi-class learning problem; and the number of interactions required in a design-optimisation context was reduced by 33%. Having access to users’ confidence judgements could significantly benefit intelligent interactive systems in industry, in areas such as intelligent tutoring systems and in health care. There are many reasons for wanting to capture information about confidence implicitly. Some are ergonomic, but others are more “social”—such as wishing to understand (and possibly take account of) users’ cognitive state without interrupting them. We investigate the hypothesis that users’ confidence can be accurately predicted from measurements of their behaviour. Eye-tracking systems were used to capture users’ gaze patterns as they undertook a series of visual decision tasks, after each of which they reported their confidence on a 5-point Likert scale. Subsequently, predictive models were built using “conventional” machine learning approaches for numerical summary features derived from users’ behaviour. We also investigate the extent to which the deep learning paradigm can reduce the need to design features specific to each application by creating “gaze maps”—visual representations of the trajectories and durations of users’ gaze fixations—and then training deep convolutional networks on these images. Treating the prediction of user confidence as a two-class problem (confident/not confident), we attained classification accuracy of 88% for the scenario of new users on known tasks, and 87% for known users on new tasks. Considering the confidence as an ordinal variable, we produced regression models with a mean absolute error of ≈0.7 in both cases. Capturing just a simple subset of non-task-specific numerical features gave slightly worse, but still quite high accuracy (e.g., MAE ≈ 1.0). Results obtained with gaze maps and convolutional networks are competitive, despite not having access to longer-term information about users and tasks, which was vital for the “summary” feature sets. This suggests that the gaze-map-based approach forms a viable, transferable alternative to handcrafting features for each different application. These results provide significant evidence to confirm our hypothesis, and offer a way of substantially improving many interactive artificial intelligence applications via the addition of cheap non-intrusive hardware and computationally cheap prediction algorithms
    corecore