152 research outputs found

    ARTICULATORY INFORMATION FOR ROBUST SPEECH RECOGNITION

    Get PDF
    Current Automatic Speech Recognition (ASR) systems fail to perform nearly as good as human speech recognition performance due to their lack of robustness against speech variability and noise contamination. The goal of this dissertation is to investigate these critical robustness issues, put forth different ways to address them and finally present an ASR architecture based upon these robustness criteria. Acoustic variations adversely affect the performance of current phone-based ASR systems, in which speech is modeled as `beads-on-a-string', where the beads are the individual phone units. While phone units are distinctive in cognitive domain, they are varying in the physical domain and their variation occurs due to a combination of factors including speech style, speaking rate etc.; a phenomenon commonly known as `coarticulation'. Traditional ASR systems address such coarticulatory variations by using contextualized phone-units such as triphones. Articulatory phonology accounts for coarticulatory variations by modeling speech as a constellation of constricting actions known as articulatory gestures. In such a framework, speech variations such as coarticulation and lenition are accounted for by gestural overlap in time and gestural reduction in space. To realize a gesture-based ASR system, articulatory gestures have to be inferred from the acoustic signal. At the initial stage of this research an initial study was performed using synthetically generated speech to obtain a proof-of-concept that articulatory gestures can indeed be recognized from the speech signal. It was observed that having vocal tract constriction trajectories (TVs) as intermediate representation facilitated the gesture recognition task from the speech signal. Presently no natural speech database contains articulatory gesture annotation; hence an automated iterative time-warping architecture is proposed that can annotate any natural speech database with articulatory gestures and TVs. Two natural speech databases: X-ray microbeam and Aurora-2 were annotated, where the former was used to train a TV-estimator and the latter was used to train a Dynamic Bayesian Network (DBN) based ASR architecture. The DBN architecture used two sets of observation: (a) acoustic features in the form of mel-frequency cepstral coefficients (MFCCs) and (b) TVs (estimated from the acoustic speech signal). In this setup the articulatory gestures were modeled as hidden random variables, hence eliminating the necessity for explicit gesture recognition. Word recognition results using the DBN architecture indicate that articulatory representations not only can help to account for coarticulatory variations but can also significantly improve the noise robustness of ASR system

    Robust learning of acoustic representations from diverse speech data

    Get PDF
    Automatic speech recognition is increasingly applied to new domains. A key challenge is to robustly learn, update and maintain representations to cope with transient acoustic conditions. A typical example is broadcast media, for which speakers and environments may change rapidly, and available supervision may be poor. The concern of this thesis is to build and investigate methods for acoustic modelling that are robust to the characteristics and transient conditions as embodied by such media. The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio with approximate labels, but training methods can be sensitive to label errors, and their use is therefore not trivial. State-of-the-art semi-supervised training makes effective use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid overfitting to poor supervision, but does not make use of the transcriptions. Existing approaches that do aim to make use of the transcriptions typically employ an algorithm to filter or combine the transcriptions with the recognition output from a seed model, but the final result does not encode uncertainty. We propose a method to combine the lattice output from a biased recognition pass with the transcripts, crucially preserving uncertainty in the lattice where appropriate. This substantially reduces the word error rate on a broadcast task. The second contribution is a method to factorise representations for speakers and environments so that they may be combined in novel combinations. In realistic scenarios, the speaker or environment transform at test time might be unknown, or there may be insufficient data to learn a joint transform. We show that in such cases, factorised, or independent, representations are required to avoid deteriorating performance. Using i-vectors, we factorise speaker or environment information using multi-condition training with neural networks. Specifically, we extract bottleneck features from networks trained to classify either speakers or environments. The resulting factorised representations prove beneficial when one factor is missing at test time, or when all factors are seen, but not in the desired combination. The third contribution is an investigation of model adaptation in a longitudinal setting. In this scenario, we repeatedly adapt a model to new data, with the constraint that previous data becomes unavailable. We first demonstrate the effect of such a constraint, and show that using a cyclical learning rate may help. We then observe that these successive models lend themselves well to ensembling. Finally, we show that the impact of this constraint in an active learning setting may be detrimental to performance, and suggest to combine active learning with semi-supervised training to avoid biasing the model. The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature extractor, known as SincNet. In contrast to traditional techniques that warp the filterbank frequencies in standard feature extraction, adapting SincNet parameters is more flexible and more readily optimised, whilst maintaining interpretability. On a task adapting from adult to child speech, we show that this layer is well suited for adaptation and is very effective with respect to the small number of adapted parameters

    Correspondence problems in computer vision : novel models, numerics, and applications

    Get PDF
    Correspondence problems like optic flow belong to the fundamental problems in computer vision. Here, one aims at finding correspondences between the pixels in two (or more) images. The correspondences are described by a displacement vector field that is often found by minimising an energy (cost) function. In this thesis, we present several contributions to the energy-based solution of correspondence problems: (i) We start by developing a robust data term with a high degree of invariance under illumination changes. Then, we design an anisotropic smoothness term that works complementary to the data term, thereby avoiding undesirable interference. Additionally, we propose a simple method for determining the optimal balance between the two terms. (ii) When discretising image derivatives that occur in our continuous models, we show that adapting one-sided upwind discretisations from the field of hyperbolic differential equations can be beneficial. To ensure a fast solution of the nonlinear system of equations that arises when minimising the energy, we use the recent fast explicit diffusion (FED) solver in an explicit gradient descent scheme. (iii) Finally, we present a novel application of modern optic flow methods where we align exposure series used in high dynamic range (HDR) imaging. Furthermore, we show how the alignment information can be used in a joint super-resolution and HDR method.Korrespondenzprobleme wie der optische Fluß, gehören zu den fundamentalen Problemen im Bereich des maschinellen Sehens (Computer Vision). Hierbei ist das Ziel, Korrespondenzen zwischen den Pixeln in zwei (oder mehreren) Bildern zu finden. Die Korrespondenzen werden durch ein Verschiebungsvektorfeld beschrieben, welches oft durch Minimierung einer Energiefunktion (Kostenfunktion) gefunden wird. In dieser Arbeit stellen wir mehrere BeitrĂ€ge zur energiebasierten Lösung von Korrespondenzproblemen vor: (i) Wir beginnen mit der Entwicklung eines robusten Datenterms, der ein hohes Maß an Invarianz unter BeleuchtungsĂ€nderungen aufweißt. Danach entwickeln wir einen anisotropen Glattheitsterm, der komplementĂ€r zu dem Datenterm wirkt und deshalb keine unerwĂŒnschten Interferenzen erzeugt. ZusĂ€tzlich schlagen wir eine einfache Methode vor, die es erlaubt die optimale Balance zwischen den beiden Termen zu bestimmen. (ii) Im Zuge der Diskretisierung von Bildableitungen, die in unseren kontinuierlichen Modellen auftauchen, zeigen wir dass es hilfreich sein kann, einseitige upwind Diskretisierungen aus dem Bereich hyperbolischer Differentialgleichungen zu ĂŒbernehmen. Um eine schnelle Lösung des nichtlinearen Gleichungssystems, dass bei der Minimierung der Energie auftaucht, zu gewĂ€hrleisten, nutzen wir den kĂŒrzlich vorgestellten fast explicit diffusion (FED) Löser im Rahmen eines expliziten Gradientenabstiegsschemas. (iii) Schließlich stellen wir eine neue Anwendung von modernen optischen Flußmethoden vor, bei der Belichtungsreihen fĂŒr high dynamic range (HDR) Bildgebung registriert werden. Außerdem zeigen wir, wie diese Registrierungsinformation in einer kombinierten super-resolution und HDR Methode genutzt werden kann

    Essays on Latent Variable Models and Roll Call Scaling

    Full text link
    This dissertation comprises three essays on latent variable models and Bayesian statistical methods for the study of American legislative institutions and the more general problems of measurement and model comparison. In the first paper, I explore the dimensionality of latent variables in the context of roll call scaling. The dimensionality of ideal points is an aspect of roll call scaling which has received significant attention due to its impact on both substantive and spatial interpretations of estimates. I find that previous evidence for unidimensional ideal points is a product of the Scree procedure. I propose a new varying dimensions model of legislative voting and a corresponding Bayesian nonparametric estimation procedure (BPIRT) that allows for probabilistic inference on the number of dimensions. Using this approach, I show that there is strong evidence for multidimensional ideal points in the U.S. Congress and that using only a single dimension misses much of the disagreement that occurs within parties. I reexamine theories of U.S. legislative voting and find that empirical evidence for these models is conditional on unidimensionality. In the second paper, I expand on the varying dimensions model of legislative voting and explore the role of group dependencies in legislative voting. Assumptions about independence of observations in the scaling model ignore the possibility that members of the voting body have shared incentives to vote as a group and lead to problems in estimating ideal points and corresponding latent dimensions. I propose a new ideal point model, clustered beta process IRT (C-BPIRT), that explicitly allows for group contributions in the underlying spatial model of voting. I derive a corresponding empirical model that uses flexible Bayesian nonparametric priors to estimate group effects in ideal points and the corresponding dimensionality of the ideal points. I apply this model to the 107th U.S. House (2001 - 2003) and the 88th U.S. House (1963 - 1965) and show how modeling group dynamics improves the estimation and interpretation of ideal points. Similarly, I show that existing methods of ideal point estimation produce results that are substantively misaligned with historical studies of the U.S. Congress. In the third and final paper, I dive into the more general problem of Bayesian model comparison and marginal likelihood computation. Various methods of computing the marginal likelihood exist, such as importance sampling or variational methods, but they frequently provide inaccurate results. I demonstrate that point estimates for the marginal likelihood achieved using importance sampling are inaccurate in settings where the joint posterior is skewed. I propose a light extension to the variational method that treats the marginal likelihood as a random variable and create a set of intervals on the marginal likelihood which do not share the same inaccuracies. I show that these new intervals, called kappa bounds, provide a computationally efficient and accurate way to estimate the marginal likelihood under arbitrarily complex Bayesian model specifications. I show the superiority of kappa bounds estimates of the marginal likelihood through a series of simulated and real-world data examples, including comparing measurement models that estimate latent variables from ordered discrete survey data.PHDPolitical ScienceUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163023/1/kamcal_1.pd

    Generative Interpretation of Medical Images

    Get PDF

    Morphology of the inner structures of the facial skeleton in Homo neanderthalensis and the case-study of the Neanderthal from Altamura (Bari, Italy)

    Get PDF
    The PhD project has the aim to provide an accurate anatomical characterization of the facial regions (with a focus on the para-nasal areas) in the fossil human species Homo neanderthalensis, whose peculiar facial morphology is the topic of unresolved hypothesis on adaptation to climate and/or phylogenetic factors. Both can be at the origin of the variability of Neanderthals and can be taken into consideration, more in general, for the human populations from the Middle and Upper Pleistocene of Europe, thus from around 800 to 11 thousand years ago (ka). In this timespan, it can be seen a differential development of a set of cranial features which was resumed by J.J. Hublin and colleagues with the ‘accretion model’. In this scenario, a Neanderthal specimen from Italy, known as the ‘Altamura Man’ and discovered in 1993 in the Lamalunga karstic system in Apulia (southern Italy), represents a crucial subject of study, because its unique state of preservation and its antiquity, comprised between 172 and 130 ka. The nearly complete skeleton is still preserved in situ because of several factors, among which its exceptional completeness and thus has been the subject of a study of virtual paleoanthropology aimed at the reconstruction and observation of facial structures often damaged or completely absent in the fossil record

    Video-Based Environment Perception for Automated Driving using Deep Neural Networks

    Get PDF
    Automatisierte Fahrzeuge benötigen eine hochgenaue Umfeldwahrnehmung, um sicher und komfortabel zu fahren. Gleichzeitig mĂŒssen die Perzeptionsalgorithmen mit der verfĂŒgbaren Rechenleistung die Echtzeitanforderungen der Anwendung erfĂŒllen. Kamerabilder stellen eine sehr wichtige Informationsquelle fĂŒr automatisierte Fahrzeuge dar. Sie beinhalten mehr Details als Daten von anderen Sensoren wie Lidar oder Radar und sind oft vergleichsweise gĂŒnstig. Damit ist es möglich, ein automatisiertes Fahrzeug mit einem Surround-View Sensor-Setup auszustatten, ohne die Gesamtkosten zu stark zu erhöhen. In dieser Arbeit prĂ€sentieren wir einen effizienten und genauen Ansatz zur videobasierten Umfeldwahrnehmung fĂŒr automatisierte Fahrzeuge. Er basiert auf Deep Learning und löst die Probleme der Objekterkennung, Objektverfolgung und der semantischen Segmentierung von Kamerabildern. Wir schlagen zunĂ€chst eine schnelle CNN-Architektur zur gleichzeitigen Objekterkennung und semantischen Segmentierung vor. Diese Architektur ist skalierbar, so dass Genauigkeit leicht gegen Rechenzeit eingetauscht werden kann, indem ein einziger Skalierungsfaktor geĂ€ndert wird. Wir modifizieren diese Architektur daraufhin, um Embedding-Vektoren fĂŒr jedes erkannte Objekt vorherzusagen. Diese Embedding-Vektoren werden als Assoziationsmetrik bei der Objektverfolgung eingesetzt. Sie werden auch fĂŒr einen neuartigen Algorithmus zur Non-Maximum Suppression eingesetzt, den wir FeatureNMS nennen. FeatureNMS kann in belebten Szenen, in denen die Annahmen des klassischen NMS-Algorithmus nicht zutreffen, einen höheren Recall erzielen. Wir erweitern anschlie{\ss}end unsere CNN-Architektur fĂŒr Einzelbilder zu einer Mehrbild-Architektur, welche zwei aufeinanderfolgende Videobilder als Eingabe entgegen nimmt. Die Mehrbild-Architektur schĂ€tzt den optischen Fluss zwischen beiden Videobildern innerhalb des kĂŒnstlichen neuronalen Netzwerks. Dies ermöglicht es, einen Verschiebungsvektor zwischen den Videobildern fĂŒr jedes detektierte Objekt zu schĂ€tzen. Diese Verschiebungsvektoren werden ebenfalls als Assoziationsmetrik bei der Objektverfolgung eingesetzt. Zuletzt prĂ€sentieren wir einen einfachen Tracking-by-Detection-Ansatz, der wenig Rechenleistung erfordert. Er benötigt einen starken Objektdetektor und stĂŒtzt sich auf die Embedding- und Verschiebungsvektoren, die von unserer CNN-Architektur geschĂ€tzt werden. Der hohe Recall des Objektdetektors fĂŒhrt zu einer hĂ€ufigen Detektion der verfolgten Objekte. Unsere diskriminativen Assoziationsmetriken, die auf den Embedding- und Verschiebungsvektoren basieren, ermöglichen eine zuverlĂ€ssige Zuordnung von neuen Detektionen zu bestehenden Tracks. Diese beiden Bestandteile erlauben es, ein einfaches Bewegungsmodell mit Annahme einer konstanten Geschwindigkeit und einem Kalman-Filter zu verwenden. Die von uns vorgestellten Methoden zur videobasierten Umfeldwahrnehmung erreichen gute Resultate auf den herausfordernden Cityscapes- und BDD100K-DatensĂ€tzen. Gleichzeitig sind sie recheneffizient und können die Echtzeitanforderungen der Anwendung erfĂŒllen. Wir verwenden die vorgeschlagene Architektur erfolgreich innerhalb des Wahrnehmungs-Moduls eines automatisierten Versuchsfahrzeugs. Hier hat sie sich in der Praxis bewĂ€hren können
    • 

    corecore