19,891 research outputs found

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Learning An Invariant Speech Representation

    Get PDF
    Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure

    Representation, space and Hollywood Squares: Looking at things that aren't there anymore

    Get PDF
    It has been argued that the human cognitive system is capable of using spatial indexes or oculomotor coordinates to relieve working memory load (Ballard, Hayhoe, Pook & Rao, 1997) track multiple moving items through occlusion (Scholl & Pylyshyn, 1999) or link incompatible cognitive and sensorimotor codes (Bridgeman and Huemer, 1998). Here we examine the use of such spatial information in memory for semantic information. Previous research has often focused on the role of task demands and the level of automaticity in the encoding of spatial location in memory tasks. We present five experiments where location is irrelevant to the task, and participants' encoding of spatial information is measured implicitly by their looking behavior during recall. In a paradigm developed from Spivey and Geng (submitted), participants were presented with pieces of auditory, semantic information as part of an event occurring in one of four regions of a computer screen. In front of a blank grid, they were asked a question relating to one of those facts. Under certain conditions it was found that during the question period participants made significantly more saccades to the empty region of space where the semantic information had been previously presented. Our findings are discussed in relation to previous research on memory and spatial location, the dorsal and ventral streams of the visual system, and the notion of a cognitive-perceptual system using spatial indexes to exploit the stability of the external world
    corecore