7 research outputs found

    ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS

    Get PDF
    Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions. Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy. An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets. Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Analysis and resynthesis of polyphonic music

    Get PDF
    This thesis examines applications of Digital Signal Processing to the analysis, transformation, and resynthesis of musical audio. First I give an overview of the human perception of music. I then examine in detail the requirements for a system that can analyse, transcribe, process, and resynthesise monaural polyphonic music. I then describe and compare the possible hardware and software platforms. After this I describe a prototype hybrid system that attempts to carry out these tasks using a method based on additive synthesis. Next I present results from its application to a variety of musical examples, and critically assess its performance and limitations. I then address these issues in the design of a second system based on Gabor wavelets. I conclude by summarising the research and outlining suggestions for future developments

    "And, so today, a new season of American renewal has begun." A Critical Metaphor Analysis of NEWNESS in American Presidential Discourse.

    Get PDF
    This book-length study examines the importance of the concept of NEWNESS in American presidential discourse, with a focus on the inaugural address as a major institutionalised form of public address on the American political calendar. Throughout its seven themed chapters, this doctoral thesis reconceptualises the inaugural as an epideictic genre with the dual function of cultural transmission and political renewal. In line with constructivist theorising on meaning, it is argued that the very concept of NEWNESS furnishes considerable ideological potential which is used strategically in the inaugural address to build a communicative memory. After setting the stage, in chapter 1, for an account of American political discourse that is shaped by the appropriation of newness to construct an identity as a ‘new nation’ of the ‘new world’, chapter 2 goes on to expose the interdisciplinary nature of political discourse in laying the foundations for a dynamic conception of meaning representation in discourse as the product of individual and collective mental processes. Chapter 3 moves on to track the development of the concept of NEWNESS against the backdrop of the generational cycles of American national history, focusing on the important notion of ‘covenant’. Covenant theology became one of the cornerstones of the American tradition of public oratory, which itself has been reported to be one of the main contributing factors to the rise of the ‘rhetorical presidency’. Chapters 4 to 6 form the centrepiece of the qualitative analysis of NEWNESS as a powerful lexical and conceptual resource in framing American presidential discourse. The conceptual domain of NEWNESS is fully fleshed out in chapter 4, which begins by outlining its conceptual history and subsequently highlights its importance as a cognitive cultural model. When encoded in a language-specific format, for example as the adjective new, the analysis revealed that the meaning potential of new clustered around two major semantic subsets: Newness may, on the one hand, be understood in terms of time and origin or may be expressed in terms of kind and quality on the other. Not only do these construals highlight the lexical concept’s polysemy, they also demonstrate that the meaning potential of new is grounded in temporal and spatial cognition. As such, the cognitive domain of NEWNESS is metaphorically structured and is conceptually linked to the concept of CHANGE. Cognitively, both are united in the complex system of the Event Structure Metaphor (ESM), whose central mapping is the conceptual metaphor CHANGE IS MOTION. In chapter 5, the inaugural address, which makes up the database of the present study, is introduced as a specific genre of presidential speech, including its main communicative functions, and followed by a reconceptualisation of the inaugural address as an epideictic genre. Prior to the in-depth qualitative analysis of the entrenchment of NEWNESS in the American presidential inaugural address, the analytical procedures are established. The analysis in chapter 6 had two main objectives: It first involved examining the construal of NEWNESS in presidential discourse on the basis of all nineteenth-century and twentieth-century inaugural speeches. The second part aimed at uncovering the mechanisms of how this particular construal may contribute to the representation of ideological discourse structures on both the sentence and textual levels. Results showed that, while in nineteenth-century inaugurals the construal of NEWNESS chiefly serves to encode specific participant roles, such as the American nation as an agent, the sentential event schemas identified in the twentieth-century speeches tend to invoke the contrast between the categories OLD and NEW by means of antonym construal. This patterning gives rise to antagonistic frames, which assign positive values to ‘new’ and negative values to ‘old’. The analysis of larger portions of discourse revealed that meaning representation in the inaugural address largely occurs on the basis of motion concepts, including the conceptual metaphor CHANGE IS MOTION as the central mapping of the ESM. As shown, these metaphorical frames help entrench ideological structures, which operate both locally and globally in presidential discourse by expressing opposing value judgements about the presence or absence of CHANGE and NEWNESS, respectively. The thesis concludes that the concept of NEWNESS provides a powerful conceptual frame for what may be considered the inaugural’s most crucial rhetorical function, that is, the transmission of core values of the American public philosophy or civil religion. Conceptual metaphor proved to be a suitable analytical tool for identifying the knowledge structures that underpin political cognition, entrenched as cognitive models in which information about beliefs, values and social practices is stored
    corecore