36 research outputs found

    Speech recognition systems and russian pronunciation variation in the context of VoiceInteraction

    Get PDF
    The present thesis aims to describe the work performed during the internship for the master’s degree in Linguistics at VoiceInteraction, an international Artificial Intelligence (AI) company, specializing in developing speech processing technologies. The goal of the internship was to study phonetic characteristics of the Russian language, attending to four main tasks: description of the phonetic-phonological inventory; validation of transcriptions of broadcast news; validation of a previously created lexicon composed by ten thousand (10 000) most frequently observed words in a text corpus crawled from Russian reference newspapers websites; and integration of filled pauses into the Automatic Speech Recognizer (ASR). Initially, a collection of audio and text broadcast news media from Russian-speaking regions, European Russian, Belarus, and the Caucasus Region, featuring different varieties of Russian was conducted. The extracted data and the company's existing data were used to train the acoustic, pronunciation, and language models. The audio data was automatically processed in a proprietary platform and then revised by human annotators. Transcriptions produced automatically and reviewed by annotators were analyzed, and the most common errors were extracted to provide feedback to the community of annotators. The validation of transcriptions, along with the annotation of all of the disfluencies (that previously were left out), resulted in the decrease of Word Error Rate (WER) in most cases. In some cases (in European Russian transcriptions), WER increased, the models were not sufficiently effective to identify the correct words, potentially problematic. Also, audio with overlapped speech, disfluencies, and acoustic events can impact the WER. Since we used the model that was only trained with European Russian to recognize other varieties of Russian language, it resulted in high WER for Belarus and the Caucasus region. The characterization of the Russian phonetic-phonological inventory and the construction of pronunciation rules for internal and external sandhi phenomena were performed for the validation of the lexicon – ten thousand of the most frequently observed words in a text corpus crawled from Russian reference newspapers websites, were revised and modified for the extraction of linguistic patterns to be used in a statistical Grapheme-to-phone (G2P) model. Two evaluations were conducted: before the modifications to the lexicon and after. Preliminary results without training the model show no significant results - 19.85% WER before the modifications, and 19.97% WER after, with a difference of 0.12%. However, we observed a slight improvement of the most frequent words. In the future, we aim to extend the analysis of the lexicon to the 400 000 entries (total lexicon size), analyze the type of errors that are produced, decrease the word error rate (WER), and analyze acoustic models, as well. In this work, we also studied filled pauses, since we believe that research on filled pauses for the Russian language can improve the recognition system of VoiceInteraction, by reducing the processing time and increasing the quality. These are marked in the transcriptions with “%”. In Russian, according to the literature (Ten, 2015; Harlamova, 2008; Bogradonova-Belgarian & Baeva, 2018), these are %a [a], %am [am], %@ [ə], %@m [əm], %e [e], %ɨ [ɨ], %m [m], and %n [n]. In the speech data, two more filled pauses were found, namely, %na [na] and %mna [mna], as far as we know, not yet referenced in the literature. Finally, the work performed during an internship contributed to a European project - Artificial Intelligence and Advanced Data Analysis for Authority Agencies (AIDA). The main goal of the present project is to build a solution capable of automating the processing of large amounts of data that Law Enforcement Agencies (LEAs) have to analyze in the investigations of Terrorism and Cybercrime, using pioneering machine learning and artificial intelligence methods. VoiceInteraction's main contribution to the project was to apply ASR and validate the transcriptions of the Russian (religious-related content). In order to do so, all the tasks performed during the thesis were very relevant and applied in the scope of the AIDA project. Transcription analysis results from the AIDA project showed a high Out-of-Vocabulary (OOV) rate and high substitution (SUBS) rate. Since the language model used in this project was adapted for broadcast content, the religious-related words were left out. Also, function words were incorrectly recognized, in most cases, due to coarticulation with the previous or the following word.A presente tese descreve o trabalho que foi realizado no âmbito de um estágio em linguística computacional na VoiceInteraction, uma empresa de tecnologias de processamento de fala. Desde o início da sua atividade, a empresa tem-se dedicado ao desenvolvimento de tecnologia própria em várias áreas do processamento computacional da fala, entre elas, síntese de fala, processamento de língua natural e reconhecimento automático de fala, representando esta última a principal área de negócio da empresa. A tecnologia de reconhecimento de automático de fala da VoiceInteraction explora a utilização de modelos híbridos em combinação com as redes neuronais (DNN - Deep Neural Networks), que, segundo Lüscher et al. (2019), apresenta um melhor desempenho, quando comparado com modelos de end-to-end apenas. O objetivo principal do estágio focou-se no estudo da fonética da língua russa, atendendo a quatro tarefas: criação do inventário fonético-fonológico; validação das transcrições de noticiários; validação do léxico previamente criado e integração de pausas preenchidas no sistema. Inicialmente, foi realizada uma recolha dos principais meios de comunicação (áudio e texto), apresentando diferentes variedades do russo, nomeadamente, da Rússia Europeia, Bielorrússia e Cáucaso Central. Na Rússia europeia o russo é a língua oficial, na Bielorrússia o russo faz parte das línguas oficiais do país, e na região do Cáucaso Central, o russo é usado como língua franca, visto que este era falado na União Soviética e continua até hoje a ser falado nas regiões pós-Soviéticas. Tratou-se de abranger a maior cobertura possível da língua russa e neste momento apenas foi possível recolher os dados das variedades mencionadas. Os dados extraídos de momento, juntamente com os dados já existentes na empresa, foram utilizados no treino dos modelos acústicos, modelos de pronúncia e modelos de língua. Para o tratamento dos dados de áudio, estes foram inseridos numa plataforma proprietária da empresa, Calligraphus, que, para além de fornecer uma interface de transcrição para os anotadores humanos poderem transcrever os conteúdos, efetua também uma sugestão de transcrição automática desses mesmos conteúdos, a fim de diminuir o esforço despendido pelos anotadores na tarefa. De seguida, as transcrições foram analisadas, de forma a garantir que o sistema de anotação criado pela VoiceInteraction foi seguido, indicando todas as disfluências de fala (fenómenos característicos da edição da fala), tais como prolongamentos, pausas preenchidas, repetições, entre outros e transcrevendo a fala o mais próximo da realidade. Posteriormente, os erros sistemáticos foram analisados e exportados, de forma a fornecer orientações e sugestões de melhoria aos anotadores humanos e, por outro lado, melhorar o desempenho do sistema de reconhecimento. Após a validação das transcrições, juntamente com a anotação de todas as disfluências (que anteriormente eram deixadas de fora), observamos uma diminuição de WER, na maioria dos casos, tal como esperado. Porém, em alguns casos, observamos um aumento do WER. Apesar das correções efetuadas aos ficheiros analisados, os modelos não foram suficientemente eficazes no reconhecimento das palavras corretas, potencialmente problemáticas. A elevada taxa de WER nos áudios com debates políticos, está relacionada com uma maior frequência de fala sobreposta e disfluências (e.g., pausas preenchidas, prolongamentos). O modelo utilizado para reconhecer todas as variedades foi treinado apenas com a variedade de russo europeu e, por isso, o WER alto também foi observado para as variedades da Bielorrússia e para a região do Cáucaso. Numa perspetiva baseada em dados coletados pela empresa, foi realizada, de igual modo, uma caracterização e descrição do inventário fonético-fonológico do russo e a construção de regras de pronúncia, para fenómenos de sandhi interno e externo (Shcherba, 1957; Litnevskaya, 2006; Lekant, 2007; Popov, 2014). A empresa já empregava, através de um G2P estatístico específico para russo, um inventário fonético para o russo, correspondente à literatura referida anteriormente, mas o mesmo ainda não havia sido validado. Foi possível realizar uma verificação e correção, com base na caracterização dos fones do léxico do russo e nos dados ecológicos obtidos de falantes russos em situações comunicativas diversas. A validação do inventário fonético-fonológico permitiu ainda a consequente validação do léxico de russo. O léxico foi construído com base num conjunto de características (e.g., grafema em posição átona tem como pronúncia correspondente o fone [I] e em posição tónica - [i]; o grafema em posição final de palavra é pronunciado como [- vozeado] - [f]; entre outras características) e foi organizado com base no critério da frequência de uso. No total, foram verificadas dez mil (10 000) palavras mais frequentes do russo, tendo por base as estatísticas resultantes da análise dos conteúdos existentes num repositório de artigos de notícias recolhidos previamente de jornais de referência em língua russa. Foi realizada uma avaliação do sistema de reconhecimento antes e depois da modificação das dez mil palavras mais frequentemente ocorridas no léxico - 19,85% WER antes das modificações, e 19,97% WER depois, com uma diferença de 0,12%. Os resultados preliminares, sem o treino do modelo, não demonstram resultados significativos, porém, observamos uma ligeira melhoria no reconhecimento das palavras mais frequentes, tais como palavras funcionais, acrónimos, verbos, nomes, entre outros. Através destes resultados e com base nas regras criadas a partir da correção das dez mil palavras, pretendemos, no futuro, alargar as mesmas a todo o léxico, constituído por quatrocentas mil (400 000) entradas. Após a validação das transcrições e do léxico, com base na literatura, foi também possível realizar uma análise das pausas preenchidas do russo para a integração no sistema de reconhecimento. O interesse de se incluir também as pausas no reconhecedor automático deveu-se sobretudo a estes mecanismos serem difíceis de identificar automaticamente e poderem ser substituídos ou por afetarem as sequências adjacentes. De acordo com o sistema de anotação da empresa, as pausas preenchidas são marcadas na transcrição com o símbolo de percentagem - %. As pausas preenchidas do russo encontradas na literatura foram %a [a], %am [am] (Rose, 1998; Ten, 2015), %@ [ə], %@m [əm] (Bogdanova-Beglarian & Baeva, 2018) %e [e], %ɨ [ɨ], %m [m] e %n [n] (Harlamova, 2008). Nos dados de áudio disponíveis na referida plataforma, para além das pausas preenchidas mencionadas, foram encontradas mais duas, nomeadamente, %na [na] e %mna [mna], até quanto nos é dado saber, ainda não descritas na literatura. De momento, todas as pausas preenchidas referidas já fazem parte dos modelos de reconhecimento automático de fala para a língua russa. O trabalho desenvolvido durante o estágio, ou seja, a validação dos dados existentes na empresa, foi aplicado ao projeto europeu AIDA - The Artificial Intelligence and Advanced Data Analysis for Authority Agencies. O objetivo principal do presente projeto é de criar uma solução capaz de detetar possíveis crimes informáticos e de terrorismo, utilizando métodos de aprendizagem automática. A principal contribuição da VoiceInteraction para o projeto foi a aplicação do ASR e validação das transcrições do russo (conteúdo relacionado com a religião). Para tal, todas as tarefas realizadas durante a tese foram muito relevantes e aplicadas no âmbito do projeto AIDA. Os resultados da validação das transcrições do projeto, mostraram uma elevada taxa de palavras Fora de Vocabulário (OOV) e uma elevada taxa de Substituição (SUBS). Uma vez que o modelo de língua utilizado neste projeto foi adaptado ao conteúdo noticioso, as palavras relacionadas com a religião não se encontravam neste. Além disso, as palavras funcionais foram incorretamente reconhecidas, na maioria dos casos, devido à coarticulação com a palavra anterior ou a seguinte

    Detecting early signs of dementia in conversation

    Get PDF
    Dementia can affect a person's speech, language and conversational interaction capabilities. The early diagnosis of dementia is of great clinical importance. Recent studies using the qualitative methodology of Conversation Analysis (CA) demonstrated that communication problems may be picked up during conversations between patients and neurologists and that this can be used to differentiate between patients with Neuro-degenerative Disorders (ND) and those with non-progressive Functional Memory Disorder (FMD). However, conducting manual CA is expensive and difficult to scale up for routine clinical use.\ud This study introduces an automatic approach for processing such conversations which can help in identifying the early signs of dementia and distinguishing them from the other clinical categories (FMD, Mild Cognitive Impairment (MCI), and Healthy Control (HC)). The dementia detection system starts with a speaker diarisation module to segment an input audio file (determining who talks when). Then the segmented files are passed to an automatic speech recogniser (ASR) to transcribe the utterances of each speaker. Next, the feature extraction unit extracts a number of features (CA-inspired, acoustic, lexical and word vector) from the transcripts and audio files. Finally, a classifier is trained by the features to determine the clinical category of the input conversation. Moreover, we investigate replacing the role of a neurologist in the conversation with an Intelligent Virtual Agent (IVA) (asking similar questions). We show that despite differences between the IVA-led and the neurologist-led conversations, the results achieved by the IVA are as good as those gained by the neurologists. Furthermore, the IVA can be used for administering more standard cognitive tests, like the verbal fluency tests and produce automatic scores, which then can boost the performance of the classifier. The final blind evaluation of the system shows that the classifier can identify early signs of dementia with an acceptable level of accuracy and robustness (considering both sensitivity and specificity)

    Enhancing Listening and Spoken Skills in Spanish Connected Speech for Anglophones

    Get PDF
    Native speech is directed towards native listeners, not designed for comprehension and analysis by language learners. Speed of delivery, or economy of effort, produces a speech signal to which the native listener can assign the correct words. There are no discrete words in the speech signal itself therefore there is often a linguistic barrier in dealing with the local spoken language.The creation, development and application of the Dynamic Spanish Speech Corpus (DSSC) facilitated an empirically-based appreciation of speaking speed and prosody as obstacles to intelligibility for learners of Spanish. “Duologues”, natural, relaxed dialogues recorded in such a manner that each interlocutor’s performance can be studied in isolation, thus avoiding problems normally caused by cross-talk and back-channelling, made possible the identification of the key phonetic features of informal native-native dialogue, and ultimately, the creation of high quality assets/ research data based on natural (unscripted) dialogues recorded at industry audio standards.These assets were used in this study, which involved documenting productive and receptive intelligibility problems when L2 users are exposed to the Spanish speech of native speakers. The aim was to observe where intelligibility problems occur and to determine the reasons for this, based on effects of the first language of the subjects, and other criteria, such as number of years learning/using Spanish, previous exposure to spoken Spanish and gender. This was achieved by playing recorded extracts/ snippets from the DSSC to which a time-scaling tool was applied

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Robust speaker diarization for meetings

    Get PDF
    Aquesta tesi doctoral mostra la recerca feta en l'àrea de la diarització de locutor per a sales de reunions. En la present s'estudien els algorismes i la implementació d'un sistema en diferit de segmentació i aglomerat de locutor per a grabacions de reunions a on normalment es té accés a més d'un micròfon per al processat. El bloc més important de recerca s'ha fet durant una estada al International Computer Science Institute (ICSI, Berkeley, Caligornia) per un període de dos anys.La diarització de locutor s'ha estudiat força per al domini de grabacions de ràdio i televisió. La majoria dels sistemes proposats utilitzen algun tipus d'aglomerat jeràrquic de les dades en grups acústics a on de bon principi no se sap el número de locutors òptim ni tampoc la seva identitat. Un mètode molt comunment utilitzat s'anomena "bottom-up clustering" (aglomerat de baix-a-dalt), amb el qual inicialment es defineixen molts grups acústics de dades que es van ajuntant de manera iterativa fins a obtenir el nombre òptim de grups tot i acomplint un criteri de parada. Tots aquests sistemes es basen en l'anàlisi d'un canal d'entrada individual, el qual no permet la seva aplicació directa per a reunions. A més a més, molts d'aquests algorisms necessiten entrenar models o afinar els parameters del sistema usant dades externes, el qual dificulta l'aplicabilitat d'aquests sistemes per a dades diferents de les usades per a l'adaptació.La implementació proposada en aquesta tesi es dirigeix a solventar els problemes mencionats anteriorment. Aquesta pren com a punt de partida el sistema existent al ICSI de diarització de locutor basat en l'aglomerat de "baix-a-dalt". Primer es processen els canals de grabació disponibles per a obtindre un sol canal d'audio de qualitat major, a més dínformació sobre la posició dels locutors existents. Aleshores s'implementa un sistema de detecció de veu/silenci que no requereix de cap entrenament previ, i processa els segments de veu resultant amb una versió millorada del sistema mono-canal de diarització de locutor. Aquest sistema ha estat modificat per a l'ús de l'informació de posició dels locutors (quan es tingui) i s'han adaptat i creat nous algorismes per a que el sistema obtingui tanta informació com sigui possible directament del senyal acustic, fent-lo menys depenent de les dades de desenvolupament. El sistema resultant és flexible i es pot usar en qualsevol tipus de sala de reunions pel que fa al nombre de micròfons o la seva posició. El sistema, a més, no requereix en absolute dades d´entrenament, sent més senzill adaptar-lo a diferents tipus de dades o dominis d'aplicació. Finalment, fa un pas endavant en l'ús de parametres que siguin mes robusts als canvis en les dades acústiques. Dos versions del sistema es van presentar amb resultats excel.lents a les evaluacions de RT05s i RT06s del NIST en transcripció rica per a reunions, a on aquests es van avaluar amb dades de dos subdominis diferents (conferencies i reunions). A més a més, es fan experiments utilitzant totes les dades disponibles de les evaluacions RT per a demostrar la viabilitat dels algorisms proposats en aquesta tasca.This thesis shows research performed into the topic of speaker diarization for meeting rooms. It looks into the algorithms and the implementation of an offline speaker segmentation and clustering system for a meeting recording where usually more than one microphone is available. The main research and system implementation has been done while visiting the International Computes Science Institute (ICSI, Berkeley, California) for a period of two years. Speaker diarization is a well studied topic on the domain of broadcast news recordings. Most of the proposed systems involve some sort of hierarchical clustering of the data into clusters, where the optimum number of speakers of their identities are unknown a priory. A very commonly used method is called bottom-up clustering, where multiple initial clusters are iteratively merged until the optimum number of clusters is reached, according to some stopping criterion. Such systems are based on a single channel input, not allowing a direct application for the meetings domain. Although some efforts have been done to adapt such systems to multichannel data, at the start of this thesis no effective implementation had been proposed. Furthermore, many of these speaker diarization algorithms involve some sort of models training or parameter tuning using external data, which impedes its usability with data different from what they have been adapted to.The implementation proposed in this thesis works towards solving the aforementioned problems. Taking the existing hierarchical bottom-up mono-channel speaker diarization system from ICSI, it first uses a flexible acoustic beamforming to extract speaker location information and obtain a single enhanced signal from all available microphones. It then implements a train-free speech/non-speech detection on such signal and processes the resulting speech segments with an improved version of the mono-channel speaker diarization system. Such system has been modified to use speaker location information (then available) and several algorithms have been adapted or created new to adapt the system behavior to each particular recording by obtaining information directly from the acoustics, making it less dependent on the development data.The resulting system is flexible to any meetings room layout regarding the number of microphones and their placement. It is train-free making it easy to adapt to different sorts of data and domains of application. Finally, it takes a step forward into the use of parameters that are more robust to changes in the acoustic data. Two versions of the system were submitted with excellent results in RT05s and RT06s NIST Rich Transcription evaluations for meetings, where data from two different subdomains (lectures and conferences) was evaluated. Also, experiments using the RT datasets from all meetings evaluations were used to test the different proposed algorithms proving their suitability to the task.Postprint (published version

    Towards Cognizant Hearing Aids: Modeling of Content, Affect and Attention

    Get PDF

    Robust Automatic Transcription of Lectures

    Get PDF
    Die automatische Transkription von Vorträgen, Vorlesungen und Präsentationen wird immer wichtiger und ermöglicht erst die Anwendungen der automatischen Übersetzung von Sprache, der automatischen Zusammenfassung von Sprache, der gezielten Informationssuche in Audiodaten und somit die leichtere Zugänglichkeit in digitalen Bibliotheken. Im Idealfall arbeitet ein solches System mit einem Mikrofon das den Vortragenden vom Tragen eines Mikrofons befreit was der Fokus dieser Arbeit ist

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the newborn to the adult and elderly. Over the years the initial issues have grown and spread also in other fields of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years in Firenze, Italy. This edition celebrates twenty-two years of uninterrupted and successful research in the field of voice analysis
    corecore