10 research outputs found

    Advances in Binary and Multiclass Audio Segmentation with Deep Learning Techniques

    Get PDF
    Los avances tecnológicos acaecidos en la última década han cambiado completamente la forma en la que la población interactúa con el contenido multimedia. Esto ha propiciado un aumento significativo tanto en la generación como el consumo de dicho contenido. El análisis y la anotación manual de toda esta información no son factibles dado el gran volumen actual, lo que releva la necesidad de herramientas automáticas que ayuden en la transición hacia flujos de trabajo asistidos o parcialmente automáticos. En los últimos años, la mayoría de estas herramientas están basadas en el uso de redes neuronales y deep learning. En este contexto, el trabajo que se describe en esta tesis se centra en el ámbito de la extracción de información a partir de señales de audio. Particularmente, se estudia la tarea de segmentación de audio, cuyo principal objetivo es obtener una secuencia de etiquetas que aíslen diferentes regiones en una señal de entrada de acuerdo con una serie de características descritas en un conjunto predefinido de clases, como por ejemplo voz, música o ruido.La primera parte de esta memoria esta centrada en la tarea de detección de actividad de voz. Recientemente, diferentes campañas de evaluación internacionales han propuesto esta tarea como uno de sus retos. Entre ellas se encuentra el reto Fearless steps, que trabaja con audios de las grabaciones de las misiones Apollo de la NASA. Para este reto, se propone una solución basada en aprendizaje supervisado usando una red convolucional recurrente como clasificador. La principal contribución es un método que combina información de filtros de 1D y 2D en la etapa convolucional para que sea procesada posteriormente por la etapa recurrente. Motivado por la introducción de los datos del reto Fearless steps, se plantea una evaluación de diferentes técnicas de adaptación de dominio, con el objetivo de comprobar las prestaciones de un sistema entrenado con datos de dominios habituales y evaluado en este nuevo dominio presentado en el reto. Los métodos descritos no requieren de etiquetas en el dominio objetivo, lo que facilita su uso en aplicaciones prácticas. En términos generales, se observa que los métodos que buscan minimizar el cambio en las distribuciones estadísticas entre los dominios fuente y objetivo obtienen los resultados mas prometedores. Los avances recientes en técnicas de representación obtenidas mediante aprendizaje auto-supervisado han demostrado grandes mejoras en prestaciones en varias tareas relacionadas con el procesado de voz. Siguiendo esta línea, se plantea la incorporación de dichas representaciones en la tarea de detección de actividad de voz. Las ediciones más recientes del reto Fearless steps modificaron su propósito, buscando ahora evaluar las capacidades de generalización de los sistemas. El objetivo entonces con las técnicas introducidas es poder beneficiarse de grandes cantidades de datos no etiquetados para mejorar la robustez del sistema. Los resultados experimentales sugieren que el aprendizaje auto-supervisado de representaciones permite obtener sistemas que son mucho menos sensibles al cambio de dominio.En la segunda parte de este documento se analiza una tarea de segmentación de audio más genérica que busca clasificar de manera simultanea una señal de audio como voz, música, ruido o una combinación de estas. En el contexto de los datos propuesto para el reto de segmentación de audio Albayzín 2010, se presenta un enfoque basado en el uso de redes neuronales recurrentes como clasificador principal, y un modelo de postprocesado integrado por modelos ocultos de Markov. Se introduce un nuevo bloque en la arquitectura neuronal con el objetivo de eliminar la información temporal redundante, mejorando las prestaciones y reduciendo el numero de operaciones por segundo al mismo tiempo. Esta propuesta obtuvo mejores prestaciones que soluciones presentadas anteriormenteen la literatura, y que aproximaciones similares basadas en redes neuronales profundas. Mientras que los resultados con aprendizaje auto-supervisado de representaciones eran prometedores en tareas de segmentación binaria, si se aplican en tareas de segmentación multiclase surgen una serie de cuestiones. Las técnicas habituales de aumento de datos que se aplican en el entrenamiento fuerzan al modelo a compensar el ruido de fondo o la música. En estas condiciones las características obtenidas podrían no representar de manera precisa aquellas clases generadas de manera similar a las versiones aumentadas vistas en el entrenamiento. Este hecho limita la mejora global de prestaciones observada al aplicar estas técnicas en tareas como la propuesta en la evaluación Albayzín 2010.La última parte de este trabajo ha investigado la aplicación de nuevas funciones de coste en la tarea de segmentación de audio, con el principal objetivo de mitigar los problemas que se derivan de utilizar un conjunto de datos de entrenamiento limitado. Se ha demostrado que nuevas técnicas de optimización basadas en las métricas AUC y AUC parcial pueden mejorar objetivos de entrenamiento tradicionales como la entropía cruzada en varias tareas de detección. Con esta idea en mente, en esta tesis se introducen dichas técnicas en la tarea de detección de música. Considerando que la cantidad de datos etiquetados para esta tarea es limitada comparado con otras tareas, las funciones de coste basadas en la métrica AUC se aplican con el objetivo de mejorar las prestaciones cuando el conjunto de datos de entrenamiento es relativamente pequeño. La mayoría de los sistemas que utilizan las técnicas de optimización basadas en métricas AUC se limitan a tareas binarias ya que ese el ámbito de aplicación habitual de la métrica AUC. Además, el etiquetado de audios con taxonomías más detalladas en las que hay múltiples opciones posibles es más complejo, por lo que la cantidad de audio etiquetada en algunas tareas de segmentación multiclase es limitada. Como una extensión natural, se propone una generalización de las técnicas de optimización basadas en la métrica AUC binaria, de tal manera que se puedan aplicar con un número arbitrario de clases. Dos funciones de coste distintas se introducen, usando como base para su formulación las variaciones multiclase de la métrica AUC propuestas en la literatura: una basada en un enfoque uno contra uno, y otra basada en un enfoque uno contra el resto.<br /

    Contextual Person Identification in Multimedia Data

    Get PDF
    We propose methods to improve automatic person identification, regardless of the visibility of a face, by integration of multiple cues including multiple modalities and contextual information. We propose a joint learning approach using contextual information from videos to improve learned face models. Further, we integrate additional modalities in a global fusion framework. We evaluate our approaches on a novel TV series data set, consisting of over 100 000 annotated faces

    Image Analysis Applications of the Maximum Mean Discrepancy Distance Measure

    Get PDF
    The need to quantify distance between two groups of objects is prevalent throughout the signal processing world. The difference of group means computed using the Euclidean, or L2 distance, is one of the predominant distance measures used to compare feature vectors and groups of vectors, but many problems arise with it when high data dimensionality is present. Maximum mean discrepancy (MMD) is a recent unsupervised kernel-based pattern recognition method which may improve differentiation between two distinct populations over many commonly used methods such as the difference of means, when paired with the proper feature representations and kernels. MMD-based distance computation combines many powerful concepts from the machine learning literature, such as data distribution-leveraging similarity measures and kernel methods for machine learning. Due to this heritage, we posit that dissimilarity-based classification and changepoint detection using MMD can lead to enhanced separation between different populations. To test this hypothesis, we conduct studies comparing MMD and the difference of means in two subareas of image analysis and understanding: first, to detect scene changes in video in an unsupervised manner, and secondly, in the biomedical imaging field, using clinical ultrasound to assess tumor response to treatment. We leverage effective computer vision data descriptors, such as the bag-of-visual-words and sparse combinations of SIFT descriptors, and choose from an assessment of several similarity kernels (e.g. Histogram Intersection, Radial Basis Function) in order to engineer useful systems using MMD. Promising improvements over the difference of means, measured primarily using precision/recall for scene change detection, and k-nearest neighbour classification accuracy for tumor response assessment, are obtained in both applications.1 yea

    Intelligence artificielle: Les défis actuels et l'action d'Inria - Livre blanc Inria

    Get PDF
    Livre blanc Inria N°01International audienceInria white papers look at major current challenges in informatics and mathematics and show actions conducted by our project-teams to address these challenges. This document is the first produced by the Strategic Technology Monitoring & Prospective Studies Unit. Thanks to a reactive observation system, this unit plays a lead role in supporting Inria to develop its strategic and scientific orientations. It also enables the institute to anticipate the impact of digital sciences on all social and economic domains. It has been coordinated by Bertrand Braunschweig with contributions from 45 researchers from Inria and from our partners. Special thanks to Peter Sturm for his precise and complete review.Les livres blancs d’Inria examinent les grands défis actuels du numérique et présentent les actions menées par noséquipes-projets pour résoudre ces défis. Ce document est le premier produit par la cellule veille et prospective d’Inria. Cette unité, par l’attention qu’elle porte aux évolutions scientifiques et technologiques, doit jouer un rôle majeur dans la détermination des orientations stratégiques et scientifiques d’Inria. Elle doit également permettre à l’Institut d’anticiper l’impact des sciences du numérique dans tous les domaines sociaux et économiques. Ce livre blanc a été coordonné par Bertrand Braunschweig avec des contributions de 45 chercheurs d’Inria et de ses partenaires. Un grand merci à Peter Sturm pour sa relecture précise et complète. Merci également au service STIP du centre de Saclay – Île-de-France pour la correction finale de la version française

    Action recognition in depth videos using nonparametric probabilistic graphical models

    Get PDF
    Action recognition involves automatically labelling videos that contain human motion with action classes. It has applications in diverse areas such as smart surveillance, human computer interaction and content retrieval. The recent advent of depth sensing technology that produces depth image sequences has offered opportunities to solve the challenging action recognition problem. The depth images facilitate robust estimation of a human skeleton’s 3D joint positions and a high level action can be inferred from a sequence of these joint positions. A natural way to model a sequence of joint positions is to use a graphical model that describes probabilistic dependencies between the observed joint positions and some hidden state variables. A problem with these models is that the number of hidden states must be fixed a priori even though for many applications this number is not known in advance. This thesis proposes nonparametric variants of graphical models with the number of hidden states automatically inferred from data. The inference is performed in a full Bayesian setting by using the Dirichlet Process as a prior over the model’s infinite dimensional parameter space. This thesis describes three original constructions of nonparametric graphical models that are applied in the classification of actions in depth videos. Firstly, the action classes are represented by a Hidden Markov Model (HMM) with an unbounded number of hidden states. The formulation enables information sharing and discriminative learning of parameters. Secondly, a hierarchical HMM with an unbounded number of actions and poses is used to represent activities. The construction produces a simplified model for activity classification by using logistic regression to capture the relationship between action states and activity labels. Finally, the action classes are modelled by a Hidden Conditional Random Field (HCRF) with the number of intermediate hidden states learned from data. Tractable inference procedures based on Markov Chain Monte Carlo (MCMC) techniques are derived for all these constructions. Experiments with multiple benchmark datasets confirm the efficacy of the proposed approaches for action recognition

    Action recognition in depth videos using nonparametric probabilistic graphical models

    Get PDF
    Action recognition involves automatically labelling videos that contain human motion with action classes. It has applications in diverse areas such as smart surveillance, human computer interaction and content retrieval. The recent advent of depth sensing technology that produces depth image sequences has offered opportunities to solve the challenging action recognition problem. The depth images facilitate robust estimation of a human skeleton’s 3D joint positions and a high level action can be inferred from a sequence of these joint positions. A natural way to model a sequence of joint positions is to use a graphical model that describes probabilistic dependencies between the observed joint positions and some hidden state variables. A problem with these models is that the number of hidden states must be fixed a priori even though for many applications this number is not known in advance. This thesis proposes nonparametric variants of graphical models with the number of hidden states automatically inferred from data. The inference is performed in a full Bayesian setting by using the Dirichlet Process as a prior over the model’s infinite dimensional parameter space. This thesis describes three original constructions of nonparametric graphical models that are applied in the classification of actions in depth videos. Firstly, the action classes are represented by a Hidden Markov Model (HMM) with an unbounded number of hidden states. The formulation enables information sharing and discriminative learning of parameters. Secondly, a hierarchical HMM with an unbounded number of actions and poses is used to represent activities. The construction produces a simplified model for activity classification by using logistic regression to capture the relationship between action states and activity labels. Finally, the action classes are modelled by a Hidden Conditional Random Field (HCRF) with the number of intermediate hidden states learned from data. Tractable inference procedures based on Markov Chain Monte Carlo (MCMC) techniques are derived for all these constructions. Experiments with multiple benchmark datasets confirm the efficacy of the proposed approaches for action recognition

    Brain Computer Interfaces and Emotional Involvement: Theory, Research, and Applications

    Get PDF
    This reprint is dedicated to the study of brain activity related to emotional and attentional involvement as measured by Brain–computer interface (BCI) systems designed for different purposes. A BCI system can translate brain signals (e.g., electric or hemodynamic brain activity indicators) into a command to execute an action in the BCI application (e.g., a wheelchair, the cursor on the screen, a spelling device or a game). These tools have the advantage of having real-time access to the ongoing brain activity of the individual, which can provide insight into the user’s emotional and attentional states by training a classification algorithm to recognize mental states. The success of BCI systems in contemporary neuroscientific research relies on the fact that they allow one to “think outside the lab”. The integration of technological solutions, artificial intelligence and cognitive science allowed and will allow researchers to envision more and more applications for the future. The clinical and everyday uses are described with the aim to invite readers to open their minds to imagine potential further developments

    Behavior quantification as the missing link between fields: Tools for digital psychiatry and their role in the future of neurobiology

    Full text link
    The great behavioral heterogeneity observed between individuals with the same psychiatric disorder and even within one individual over time complicates both clinical practice and biomedical research. However, modern technologies are an exciting opportunity to improve behavioral characterization. Existing psychiatry methods that are qualitative or unscalable, such as patient surveys or clinical interviews, can now be collected at a greater capacity and analyzed to produce new quantitative measures. Furthermore, recent capabilities for continuous collection of passive sensor streams, such as phone GPS or smartwatch accelerometer, open avenues of novel questioning that were previously entirely unrealistic. Their temporally dense nature enables a cohesive study of real-time neural and behavioral signals. To develop comprehensive neurobiological models of psychiatric disease, it will be critical to first develop strong methods for behavioral quantification. There is huge potential in what can theoretically be captured by current technologies, but this in itself presents a large computational challenge -- one that will necessitate new data processing tools, new machine learning techniques, and ultimately a shift in how interdisciplinary work is conducted. In my thesis, I detail research projects that take different perspectives on digital psychiatry, subsequently tying ideas together with a concluding discussion on the future of the field. I also provide software infrastructure where relevant, with extensive documentation. Major contributions include scientific arguments and proof of concept results for daily free-form audio journals as an underappreciated psychiatry research datatype, as well as novel stability theorems and pilot empirical success for a proposed multi-area recurrent neural network architecture.Comment: PhD thesis cop
    corecore