5 research outputs found

    Phonetic Temporal Neural Model for Language Identification

    Get PDF
    Deep neural models, particularly the LSTM-RNN model, have shown great potential for language identification (LID). However, the use of phonetic information has been largely overlooked by most existing neural LID methods, although this information has been used very successfully in conventional phonetic LID systems. We present a phonetic temporal neural model for LID, which is an LSTM-RNN LID system that accepts phonetic features produced by a phone-discriminative DNN as the input, rather than raw acoustic features. This new model is similar to traditional phonetic LID methods, but the phonetic knowledge here is much richer: it is at the frame level and involves compacted information of all phones. Our experiments conducted on the Babel database and the AP16-OLR database demonstrate that the temporal phonetic neural approach is very effective, and significantly outperforms existing acoustic neural models. It also outperforms the conventional i-vector approach on short utterances and in noisy conditions.Comment: Submitted to TASL

    Orchestration : the movement and vocal behavior of free-ranging Norwegian killer whales (Orcinus orca)

    Get PDF
    Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the Massachusetts Institute of Technology and the Woods Hole Oceanographic Institution June 2008Studying the social and cultural transmission of behavior among animals helps to identify patterns of interaction and information content flowing between individuals. Killer whales are likely to acquire traits culturally based on their population-specific feeding behaviors and group-distinctive vocal repertoires. I used digital tags to explore the contributions of individual Norwegian killer whales to group carousel feeding and the relationships between vocal and non-vocal activity. Periods of tail slapping to incapacitate herring during feeding were characterized by elevated movement variability, heightened vocal activity and call types containing additional orientation cues. Tail slaps produced by tagged animals were identified using a rapid pitch change and occurred primarily within 20m of the surface. Two simultaneously tagged animals maneuvered similarly when tail slapping within 60s of one another, indicating that the position and composition of the herring ball influenced their behavior. Two types of behavioral sequence preceding the tight circling of carousel feeding were apparent. First, the animals engaged in periods of directional swimming. They were silent in 2 of 3 instances, suggesting they may have located other foraging groups by eavesdropping. Second, tagged animals made broad horizontal loops as they dove in a manner consistent with corralling. All 4 of these occasions were accompanied by vocal activity, indicating that this and tail slapping may benefit from social communication. No significant relationship between the call types and the actual movement measurements was found. Killer whale vocalizations traditionally have been classified into discrete call types. Using human speech processing techniques, I considered that calls are alternatively comprised of shared segments that can be recombined to form the stereotyped and variable repertoire. In a classification experiment, the characterization of calls using the whole call, a set of unshared segments, or a set of shared segments yielded equivalent performance. The shared segments required less information to parse the same vocalizations, suggesting a more parsimonious system of representation. This closer examination of the movements and vocalizations of Norwegian killer whales, combined with future work on ontogeny and transmission, will inform our understanding of whether and how culture plays a role in achieving population-specific behaviors in this species.Funding sources: The Ocean Life Institute at WHOI and the National Geographic Society, the National Defense Science and Engineering Graduate Fellowship, a National Science Foundation Graduate Fellowship, the Academic Programs Office at WHOI and Dennis McLaughlin at MIT

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    Get PDF
    En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

    Movement and vocal behavior of free-ranging Norwegian killer whales (Orcinus orca)

    Get PDF
    Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Biology; and the Woods Hole Oceanographic Institution), 2008.Includes bibliographical references.Studying the social and cultural transmission of behavior among animals helps to identify patterns of interaction and information content flowing between individuals. Killer whales are likely to acquire traits culturally based on their population-specific feeding behaviors and group-distinctive vocal repertoires. I used digital tags to explore the contributions of individual Norwegian killer whales to group carousel feeding and the relationships between vocal and non-vocal activity. Periods of tail slapping to incapacitate herring during feeding were characterized by elevated movement variability, heightened vocal activity and call types containing additional orientation cues. Tail slaps produced by tagged animals were identified using a rapid pitch change and occurred primarily within 20m of the surface. Two simultaneously tagged animals maneuvered similarly when tail slapping within 60s of one another, indicating that the position and composition of the herring ball influenced their behavior. Two types of behavioral sequence preceding the tight circling of carousel feeding were apparent. First, the animals engaged in periods of directional swimming. They were silent in 2 of 3 instances, suggesting they may have located other foraging groups by eavesdropping. Second, tagged animals made broad horizontal loops as they dove in a manner consistent with corralling. All 4 of these occasions were accompanied by vocal activity, indicating that this and tail slapping may benefit from social communication. No significant relationship between the call types and the actual movement measurements was found. Killer whale vocalizations traditionally have been classified into discrete call types. Using human speech processing techniques, I considered that calls are alternatively comprised of shared segments that can be recombined to form the stereotyped and variable repertoire.(cont.) In a classification experiment, the characterization of calls using the whole call, a set of unshared segments, or a set of shared segments yielded equivalent performance. The shared segments required less information to parse the same vocalizations, suggesting a more parsimonious system of representation. This closer examination of the movements and vocalizations of Norwegian killer whales, combined with future work on ontogeny and transmission, will inform our understanding of whether and how culture plays a role in achieving population-specific behaviors in this species.by Ari Daniel Shapiro.Ph.D
    corecore