123 research outputs found

    Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots

    Get PDF
    Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rôle crucial dans l’interaction homme-robot (HRI). Le système de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de réagir en consequence. Dans un scénario de conversation, un groupe de personnes peut discuter devant le robot et se déplacer librement. Dans de telles situations, les robots sont censés comprendre où sont les gens, ceux qui parlent et de quoi ils parlent. Cette thèse se concentre sur les deux premières questions, à savoir le suivi et la diarisation des locuteurs. Nous utilisons différentes modalités du système de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scénario de conversation. Les progrès de la vision par ordinateur et du traitement audio de la dernière décennie ont révolutionné les capacités de perception des robots. Dans cette thèse, nous développons les contributions suivantes : nous développons d’abord un cadre variationnel bayésien pour suivre plusieurs objets. Le cadre bayésien variationnel fournit des solutions explicites, rendant le processus de suivi très efficace. Cette approche est d’abord appliqué au suivi visuel de plusieurs personnes. Les processus de créations et de destructions sont en adéquation avecle modèle probabiliste proposé pour traiter un nombre variable de personnes. De plus, nous exploitons la complémentarité de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut être intégré au système de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent être utilisées pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinées dans le modèle variationnel, pour lisser les trajectoires et déduire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario où l’informationvisuelle est absente, nous essayons le modèle pour la localisation et le suivi des locuteurs basé sur l’information acoustique uniquement. Les techniques de déréverbération sont d’abord appliquées, dont le résultat est fourni au système de suivi. Enfin, une variante du modèle de suivi des locuteurs basée sur la distribution de von-Mises est proposée, celle-ci étant plus adaptée aux données directionnelles. Toutes les méthodes proposées sont validées sur des bases de données specifiques à chaque application

    Audio Source Positioning Based on Angle of Arrival Measurements

    Get PDF
    Estimating position is done in various contexts from locating phones with GPS to locating boats using hydrophones. In this thesis we study estimating audio source position based on angle of arrival measurements. Multiple different filters can be used on measured angles of arrival to deduce the position of the source. The filter to be used in this work was chosen to be the particle filter. Even though particle filter is computationally more heavy than many other filters, modern computers can simulate hundreds of particles in a short time without too much of an effort. We introduce the reader to the use of particle filter in positioning, along with theoretical background of it and positioning in a more general sense. The data in this work is recorded in either an anechoic chamber or a room that has no special equipment installed to enhance audio quality in it. The measurements are done with a mobile device with four microphones. Audio source in the anechoic chamber is a loudspeaker playing speech or a person speaking and walking randomly in the room. If the data contains noise, it is played from loudspeakers in the same space as the source is located in. Another type of data handled in this work is measured outside in a racing event where multiple cars passed the measurement device as well as generated data with multiple sources. The data is handled as a mixture between von Mises and uniform distribution. An important parameter of von Mises distribution is a variable called κ, which tells the concentration of the distribution. In this work we show and prove a way to estimate said variable with maximum likelihood method. Additionally, we introduce the reader to mathematical background of particle filter and positioning in more general sense. Results given by the particle filter depend on the chosen value of κ along with chosen q-value, which tells the smoothness of the result, and measurement model. Finally, we present and compare the results obtained by constant velocity and random walk models with several different q-values

    Selective attention and speech processing in the cortex

    Full text link
    In noisy and complex environments, human listeners must segregate the mixture of sound sources arriving at their ears and selectively attend a single source, thereby solving a computationally difficult problem called the cocktail party problem. However, the neural mechanisms underlying these computations are still largely a mystery. Oscillatory synchronization of neuronal activity between cortical areas is thought to provide a crucial role in facilitating information transmission between spatially separated populations of neurons, enabling the formation of functional networks. In this thesis, we seek to analyze and model the functional neuronal networks underlying attention to speech stimuli and find that the Frontal Eye Fields play a central 'hub' role in the auditory spatial attention network in a cocktail party experiment. We use magnetoencephalography (MEG) to measure neural signals with high temporal precision, while sampling from the whole cortex. However, several methodological issues arise when undertaking functional connectivity analysis with MEG data. Specifically, volume conduction of electrical and magnetic fields in the brain complicates interpretation of results. We compare several approaches through simulations, and analyze the trade-offs among various measures of neural phase-locking in the presence of volume conduction. We use these insights to study functional networks in a cocktail party experiment. We then construct a linear dynamical system model of neural responses to ongoing speech. Using this model, we are able to correctly predict which of two speakers is being attended by a listener. We then apply this model to data from a task where people were attending to stories with synchronous and scrambled videos of the speakers' faces to explore how the presence of visual information modifies the underlying neuronal mechanisms of speech perception. This model allows us to probe neural processes as subjects listen to long stimuli, without the need for a trial-based experimental design. We model the neural activity with latent states, and model the neural noise spectrum and functional connectivity with multivariate autoregressive dynamics, along with impulse responses for external stimulus processing. We also develop a new regularized Expectation-Maximization (EM) algorithm to fit this model to electroencephalography (EEG) data

    A Learning-Based EM Clustering for Circular Data with Unknown Number of Clusters

    Get PDF
    Clustering is a method for analyzing grouped data. Circular data were well used in various applications, such as wind directions, departure directions of migrating birds or animals, etc. The expectation & maximization (EM) algorithm on mixtures of von Mises distributions is popularly used for clustering circular data. In general, the EM algorithm is sensitive to initials and not robust to outliers in which it is also necessary to give a number of clusters a priori. In this paper, we consider a learning-based schema for EM, and then propose a learning-based EM algorithm on mixtures of von Mises distributions for clustering grouped circular data. The proposed clustering method is without any initial and robust to outliers with automatically finding the number of clusters. Some numerical and real data sets are used to compare the proposed algorithm with existing methods. Experimental results and comparisons actually demonstrate these good aspects of effectiveness and superiority of the proposed learning-based EM algorithm

    Variational Inference and Learning of Piecewise-linear Dynamical Systems

    Get PDF
    International audienceModeling the temporal behavior of data is of primordial importance in many scientific and engineering fields. Baseline methods assume that both the dynamic and observation equations follow linear-Gaussian models. However, there are many real-world processes that cannot be characterized by a single linear behavior. Alternatively, it is possible to consider a piecewise-linear model which, combined with a switching mechanism, is well suited when several modes of behavior are needed. Nevertheless, switching dynamical systems are intractable because their computational complexity increases exponentially with time. In this paper, we propose a variational approximation of piecewise linear dynamical systems. We provide full details of the derivation of two variational expectation-maximization algorithms, a filter and a smoother. We show that the model parameters can be split into two sets, static and dynamic parameters, and that the former parameters can be estimated off-line together with the number of linear modes, or the number of states of the switching variable. We apply the proposed method to a visual tracking problem, namely head-pose tracking, and we thoroughly compare our algorithms with several state of the art trackers

    Connectionist systems for image processing and anomaly detection

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaA Inteligência Artificial (IA) e a Ciência de Dados estão cada vez mais presentes no nosso quotidiano e os benefícios que trouxeram para a sociedade nos últimos anos são notáveis. O sucesso da IA foi impulsionado pela capacidade adaptativa que as máquinas adquiriram e está estreitamente relacionada com a sua habilidade para aprender. Os sistemas conexionistas, apresentados na forma de Redes Neurais Artificiais (RNAs), que se inspiram no sistema nervoso humano, são um dos mais importantes modelos que permitem a aprendizagem. Estes são utilizados em diversas áreas, como em problemas de previsão ou classificação, apresentando resultados cada vez mais satisfatórios. Uma das áreas em que esta tecnologia se tem destacado é a Visão Computacional (Computer Vision (CV)), permitindo, por exemplo, a localização de objetos em imagens e a sua correta identificação. A Deteção de Anomalias (Anomaly Detection (AD)) é outro campo onde as RNAs vêm surgindo como uma das tecnologias para a resolução de problemas. Em cada área são utilizadas diferentes arquiteturas de acordo com o tipo de dados e o problema a resolver. Combinando o processamento de imagens e a deteção de anomalias, verifica-se uma convergência de metodologias que utilizam módulos convolucionais em arquiteturas dedicadas a AD. O objetivo principal desta dissertação é estudar as técnicas existentes nestes domínios, desenvolvendo diferentes arquiteturas e modelos, aplicando-as a casos práticos de forma a comparar os resultados obtidos em cada abordagem. O caso prático principal consiste na monitorização de pavimentos rodoviários por meio de imagens para a identificação automática de áreas degradadas. Para isso, dois protótipos de software são propostos para recolher e visualizar os dados adquiridos. O estudo de arquiteturas de RNAs para o diagnóstico da condição do asfalto por meio de imagens é o foco central no processo científico apresentado. Os métodos de Machine Learning (ML) utilizados incluem classificadores binários, Autoencoders (AEs) e Variational Autoencoders (VAEs). Para os dois últimos modelos, práticas supervisionadas e não supervisionadas são também comparadas, comprovando a sua utilidade em cenários onde não há dados rotulados disponíveis. Usando o modelo VAE num ambiente supervisionado, este apresenta uma excelente distinção entre áreas de pavimentação em boas condições e degradadas. Quando não existem dados rotulados disponíveis, a melhor opção é utilizar o modelo AE, utilizando a distribuição de semelhanças das reconstruções para calcular o threshold de separação, atingindo accuracy e precision superiores a 94%). O processo completo de desenvolvimento mostra que é possível construir uma solução alternativa para diminuir os custos de operação em relação aos sistemas comerciais existentes e melhorar a usabilidade quando comparada às soluções tradicionais. Adicionalmente, dois estudos demonstram a versatilidade dos sistemas conexionistas na resolução de problemas, nomeadamente no projeto de estruturas mecânicas, possibilitando a modelação de campos de deslocamento e pressão em placas reforçadas; e na utilização de AD para identificar locais de aglomeração de pessoas através de técnicas de crowdsensing.Artificial Intelligence (AI) and Data Science (DS) have become increasingly present in our daily lives, and the benefits it has brought to society in recent years are remarkable. The success of AI was driven by the adaptive capacity that machines gained, and it is closely related to their ability to learn. Connectionist systems, presented in the form of Artificial Neural Networks (ANNs), which are inspired by the human nervous system, are one of the principal models that allows learning. These models are used in several areas, like forecasting or classification problems, presenting increasingly satisfactory results. One area in which this technology has excelled is Com puter Vision (CV), allowing, for example, the location of objects in images and their correct identification. Anomaly Detection (AD) is another field where ANNs have been emerging as one technology for problem solving. In each area, different architectures are used according to the type of data and the problem to be solved. Combining im age processing and the finding of anomalies in this type of data, there is a convergence of methodologies using convolutional modules in architectures dedicated to AD. The main objective of this dissertation is to study the existent techniques in these domains, developing different model architectures, and applying them to practical case studies in order to compare the results obtained in each approach. The major practical use case consists of monitoring road pavements using images to automatically identify degraded areas. For that, two software prototypes are proposed to gather and visualise the acquired data. Moreover, the study of ANN architectures to diagnose the asphalt condition through images is the central focus of this work. The experimented methods for AD in images include a binary classifier network as a baseline, Autoencoders (AEs) and Variational Autoen coders (VAEs). Supervised and unsupervised practises are also compared, proving their utility also in scenarios where there is no labelled data available. Using the VAE model in a supervised setting, it presents a excellent distinction between good and bad pavement areas. When labelled data is not available, using the AE and the distribution of similarities of good pavement reconstructions to calculate the threshold is the best option with both accuracy and precision above 94%. The full development process shows it is possible to build an alternative solution to decrease the operation costs relatively to expensive commercial systems and improve usability when compared with traditional solutions. Additionally, two case studies demonstrate the versatility of connectionist systems to solve problems, namely in Mechanical Structural Design enabling the modelling of displacement and pressure fields in reinforced plates; and using AD to identify crowded places through crowd-sensing techniques

    Bayesian Modelling of Functional Whole Brain Connectivity

    Get PDF

    딥러닝 기반 생성 모델을 이용한 자연어처리 데이터 증강 기법

    Get PDF
    학위논문(박사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2020. 2. 이상구.Recent advances in generation capability of deep learning models have spurred interest in utilizing deep generative models for unsupervised generative data augmentation (GDA). Generative data augmentation aims to improve the performance of a downstream machine learning model by augmenting the original dataset with samples generated from a deep latent variable model. This data augmentation approach is attractive to the natural language processing community, because (1) there is a shortage of text augmentation techniques that require little supervision and (2) resource scarcity being prevalent. In this dissertation, we explore the feasibility of exploiting deep latent variable models for data augmentation on three NLP tasks: sentence classification, spoken language understanding (SLU) and dialogue state tracking (DST), represent NLP tasks of various complexities and properties -- SLU requires multi-task learning of text classification and sequence tagging, while DST requires the understanding of hierarchical and recurrent data structures. For each of the three tasks, we propose a task-specific latent variable model based on conditional, hierarchical and sequential variational autoencoders (VAE) for multi-modal joint modeling of linguistic features and the relevant annotations. We conduct extensive experiments to statistically justify our hypothesis that deep generative data augmentation is beneficial for all subject tasks. Our experiments show that deep generative data augmentation is effective for the select tasks, supporting the idea that the technique can potentially be utilized for other range of NLP tasks. Ablation and qualitative studies reveal deeper insight into the underlying mechanisms of generative data augmentation. As a secondary contribution, we also shed light onto the recurring posterior collapse phenomenon in autoregressive VAEs and, subsequently, propose novel techniques to reduce the model risk, which is crucial for proper training of complex VAE models, enabling them to synthesize better samples for data augmentation. In summary, this work intends to demonstrate and analyze the effectiveness of unsupervised generative data augmentation in NLP. Ultimately, our approach enables standardized adoption of generative data augmentation, which can be applied orthogonally to existing regularization techniques.최근 딥러닝 기반 생성 모델의 급격한 발전으로 이를 이용한 생성 기반 데이터 증강 기법(generative data augmentation, GDA)의 실현 가능성에 대한 기대가 커지고 있다. 생성 기반 데이터 증강 기법은 딥러닝 기반 잠재변수 모델에서 생성 된 샘플을 원본 데이터셋에 추가하여 연관된 태스크의 성능을 향상시키는 기술을 의미한다. 따라서 생성 기반 데이터 증강 기법은 데이터 공간에서 이뤄지는 정규화 기술의 한 형태로 간주될 수 있다. 이러한 딥러닝 기반 생성 모델의 새로운 활용 가능성은 자연어처리 분야에서 더욱 중요하게 부각되는 이유는 (1) 범용 가능한 텍스트 데이터 증강 기술의 부재와 (2) 텍스트 데이터의 희소성을 극복할 수 있는 대안이 필요하기 때문이다. 문제의 복잡도와 특징을 골고루 채집하기 위해 본 논문에서는 텍스트 분류(text classification), 순차적 레이블링과 멀티태스킹 기술이 필요한 발화 이해(spoken language understanding, SLU), 계층적이며 재귀적인 데이터 구조에 대한 고려가 필요한 대화 상태 추적(dialogue state tracking, DST) 등 세 가지 문제에서 딥러닝 기반 생성 모델을 활용한 데이터 증강 기법의 타당성에 대해 다룬다. 본 연구에서는 조건부, 계층적 및 순차적 variational autoencoder (VAE)에 기반하여 각 자연어처리 문제에 특화된 텍스트 및 연관 부착 정보를 동시에 생성하는 특수 딥러닝 생성 모델들을 제시하고, 다양한 하류 모델과 데이터셋을 다루는 등 폭 넓은 실험을 통해 딥 생성 모델 기반 데이터 증강 기법의 효과를 통계적으로 입증하였다. 부수적 연구에서는 자기회귀적(autoregressive) VAE에서 빈번히 발생하는 posterior collapse 문제에 대해 탐구하고, 해당 문제를 완화할 수 있는 신규 방안도 제안한다. 해당 방법을 생성적 데이터 증강에 필요한 복잡한 VAE 모델에 적용하였을 때, 생성 모델의 생성 질이 향상되어 데이터 증강 효과에도 긍정적인 영향을 미칠 수 있음을 검증하였다. 본 논문을 통해 자연어처리 분야에서 기존 정규화 기법과 병행 적용 가능한 비지도 형태의 데이터 증강 기법의 표준화를 기대해 볼 수 있다.1 Introduction 1 1.1 Motivation 1 1.2 Dissertation Overview 6 2 Background and Related Work 8 2.1 Deep Latent Variable Models 8 2.1.1 Variational Autoencoder (VAE) 10 2.1.2 Deep Generative Models and Text Generation 12 2.2 Data Augmentation 12 2.2.1 General Description 13 2.2.2 Categorization of Data Augmentation 14 2.2.3 Theoretical Explanations 21 2.3 Summary 24 3 Basic Task: Text Classi cation 25 3.1 Introduction 25 3.2 Our Approach 28 3.2.1 Proposed Models 28 3.2.2 Training with I-VAE 29 3.3 Experiments 31 3.3.1 Datasets 32 3.3.2 Experimental Settings 33 3.3.3 Implementation Details 34 3.3.4 Data Augmentation Results 36 3.3.5 Ablation Studies 39 3.3.6 Qualitative Analysis 40 3.4 Summary 45 4 Multi-task Learning: Spoken Language Understanding 46 4.1 Introduction 46 4.2 Related Work 48 4.3 Model Description 48 4.3.1 Framework Formulation 48 4.3.2 Joint Generative Model 49 4.4 Experiments 56 4.4.1 Datasets 56 4.4.2 Experimental Settings 57 4.4.3 Generative Data Augmentation Results 61 4.4.4 Comparison to Other State-of-the-art Results 63 4.4.5 Ablation Studies 63 4.5 Summary 67 5 Complex Data: Dialogue State Tracking 68 5.1 Introduction 68 5.2 Background and Related Work 70 5.2.1 Task-oriented Dialogue 70 5.2.2 Dialogue State Tracking 72 5.2.3 Conversation Modeling 72 5.3 Variational Hierarchical Dialogue Autoencoder (VHDA) 73 5.3.1 Notations 73 5.3.2 Variational Hierarchical Conversational RNN 74 5.3.3 Proposed Model 75 5.3.4 Posterior Collapse 82 5.4 Experimental Results 84 5.4.1 Experimental Settings 84 5.4.2 Data Augmentation Results 90 5.4.3 Intrinsic Evaluation - Language Evaluation 94 5.4.4 Qualitative Results 95 5.5 Summary 101 6 Conclusion 103 6.1 Summary 103 6.2 Limitations 104 6.3 Future Work 105Docto
    corecore