Search CORE

164 research outputs found

Prosodic modules for speech recognition and understanding in VERBMOBIL

Author: Batliner Anton
Hess Wolfgang
Kießling Andreas
Kompe Ralf
Nöth Elmar
Petzol Anja
Reyelt Matthias
Strom Volker
Publication venue: Sonstige Einrichtungen. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz
Publication date: 01/01/1996
Field of study

Within VERBMOBIL, a large project on spoken language research in Germany, two modules for detecting and recognizing prosodic events have been developed. One module operates on speech signal parameters and the word hypothesis graph, whereas the other module, designed for a novel, highly interactive architecture, only uses speech signal parameters as its input. Phrase boundaries, sentence modality, and accents are detected. The recognition rates in spontaneous dialogs are for accents up to 82,5%, for phrase boundaries up to 91,7%

CiteSeerX

Speech recognition systems and russian pronunciation variation in the context of VoiceInteraction

Author: Havras Anna
Publication venue
Publication date: 15/03/2023
Field of study

The present thesis aims to describe the work performed during the internship for the master’s degree in Linguistics at VoiceInteraction, an international Artificial Intelligence (AI) company, specializing in developing speech processing technologies. The goal of the internship was to study phonetic characteristics of the Russian language, attending to four main tasks: description of the phonetic-phonological inventory; validation of transcriptions of broadcast news; validation of a previously created lexicon composed by ten thousand (10 000) most frequently observed words in a text corpus crawled from Russian reference newspapers websites; and integration of filled pauses into the Automatic Speech Recognizer (ASR). Initially, a collection of audio and text broadcast news media from Russian-speaking regions, European Russian, Belarus, and the Caucasus Region, featuring different varieties of Russian was conducted. The extracted data and the company's existing data were used to train the acoustic, pronunciation, and language models. The audio data was automatically processed in a proprietary platform and then revised by human annotators. Transcriptions produced automatically and reviewed by annotators were analyzed, and the most common errors were extracted to provide feedback to the community of annotators. The validation of transcriptions, along with the annotation of all of the disfluencies (that previously were left out), resulted in the decrease of Word Error Rate (WER) in most cases. In some cases (in European Russian transcriptions), WER increased, the models were not sufficiently effective to identify the correct words, potentially problematic. Also, audio with overlapped speech, disfluencies, and acoustic events can impact the WER. Since we used the model that was only trained with European Russian to recognize other varieties of Russian language, it resulted in high WER for Belarus and the Caucasus region. The characterization of the Russian phonetic-phonological inventory and the construction of pronunciation rules for internal and external sandhi phenomena were performed for the validation of the lexicon – ten thousand of the most frequently observed words in a text corpus crawled from Russian reference newspapers websites, were revised and modified for the extraction of linguistic patterns to be used in a statistical Grapheme-to-phone (G2P) model. Two evaluations were conducted: before the modifications to the lexicon and after. Preliminary results without training the model show no significant results - 19.85% WER before the modifications, and 19.97% WER after, with a difference of 0.12%. However, we observed a slight improvement of the most frequent words. In the future, we aim to extend the analysis of the lexicon to the 400 000 entries (total lexicon size), analyze the type of errors that are produced, decrease the word error rate (WER), and analyze acoustic models, as well. In this work, we also studied filled pauses, since we believe that research on filled pauses for the Russian language can improve the recognition system of VoiceInteraction, by reducing the processing time and increasing the quality. These are marked in the transcriptions with “%”. In Russian, according to the literature (Ten, 2015; Harlamova, 2008; Bogradonova-Belgarian & Baeva, 2018), these are %a [a], %am [am], %@ [ə], %@m [əm], %e [e], %ɨ [ɨ], %m [m], and %n [n]. In the speech data, two more filled pauses were found, namely, %na [na] and %mna [mna], as far as we know, not yet referenced in the literature. Finally, the work performed during an internship contributed to a European project - Artificial Intelligence and Advanced Data Analysis for Authority Agencies (AIDA). The main goal of the present project is to build a solution capable of automating the processing of large amounts of data that Law Enforcement Agencies (LEAs) have to analyze in the investigations of Terrorism and Cybercrime, using pioneering machine learning and artificial intelligence methods. VoiceInteraction's main contribution to the project was to apply ASR and validate the transcriptions of the Russian (religious-related content). In order to do so, all the tasks performed during the thesis were very relevant and applied in the scope of the AIDA project. Transcription analysis results from the AIDA project showed a high Out-of-Vocabulary (OOV) rate and high substitution (SUBS) rate. Since the language model used in this project was adapted for broadcast content, the religious-related words were left out. Also, function words were incorrectly recognized, in most cases, due to coarticulation with the previous or the following word.A presente tese descreve o trabalho que foi realizado no âmbito de um estágio em linguística computacional na VoiceInteraction, uma empresa de tecnologias de processamento de fala. Desde o início da sua atividade, a empresa tem-se dedicado ao desenvolvimento de tecnologia própria em várias áreas do processamento computacional da fala, entre elas, síntese de fala, processamento de língua natural e reconhecimento automático de fala, representando esta última a principal área de negócio da empresa. A tecnologia de reconhecimento de automático de fala da VoiceInteraction explora a utilização de modelos híbridos em combinação com as redes neuronais (DNN - Deep Neural Networks), que, segundo Lüscher et al. (2019), apresenta um melhor desempenho, quando comparado com modelos de end-to-end apenas. O objetivo principal do estágio focou-se no estudo da fonética da língua russa, atendendo a quatro tarefas: criação do inventário fonético-fonológico; validação das transcrições de noticiários; validação do léxico previamente criado e integração de pausas preenchidas no sistema. Inicialmente, foi realizada uma recolha dos principais meios de comunicação (áudio e texto), apresentando diferentes variedades do russo, nomeadamente, da Rússia Europeia, Bielorrússia e Cáucaso Central. Na Rússia europeia o russo é a língua oficial, na Bielorrússia o russo faz parte das línguas oficiais do país, e na região do Cáucaso Central, o russo é usado como língua franca, visto que este era falado na União Soviética e continua até hoje a ser falado nas regiões pós-Soviéticas. Tratou-se de abranger a maior cobertura possível da língua russa e neste momento apenas foi possível recolher os dados das variedades mencionadas. Os dados extraídos de momento, juntamente com os dados já existentes na empresa, foram utilizados no treino dos modelos acústicos, modelos de pronúncia e modelos de língua. Para o tratamento dos dados de áudio, estes foram inseridos numa plataforma proprietária da empresa, Calligraphus, que, para além de fornecer uma interface de transcrição para os anotadores humanos poderem transcrever os conteúdos, efetua também uma sugestão de transcrição automática desses mesmos conteúdos, a fim de diminuir o esforço despendido pelos anotadores na tarefa. De seguida, as transcrições foram analisadas, de forma a garantir que o sistema de anotação criado pela VoiceInteraction foi seguido, indicando todas as disfluências de fala (fenómenos característicos da edição da fala), tais como prolongamentos, pausas preenchidas, repetições, entre outros e transcrevendo a fala o mais próximo da realidade. Posteriormente, os erros sistemáticos foram analisados e exportados, de forma a fornecer orientações e sugestões de melhoria aos anotadores humanos e, por outro lado, melhorar o desempenho do sistema de reconhecimento. Após a validação das transcrições, juntamente com a anotação de todas as disfluências (que anteriormente eram deixadas de fora), observamos uma diminuição de WER, na maioria dos casos, tal como esperado. Porém, em alguns casos, observamos um aumento do WER. Apesar das correções efetuadas aos ficheiros analisados, os modelos não foram suficientemente eficazes no reconhecimento das palavras corretas, potencialmente problemáticas. A elevada taxa de WER nos áudios com debates políticos, está relacionada com uma maior frequência de fala sobreposta e disfluências (e.g., pausas preenchidas, prolongamentos). O modelo utilizado para reconhecer todas as variedades foi treinado apenas com a variedade de russo europeu e, por isso, o WER alto também foi observado para as variedades da Bielorrússia e para a região do Cáucaso. Numa perspetiva baseada em dados coletados pela empresa, foi realizada, de igual modo, uma caracterização e descrição do inventário fonético-fonológico do russo e a construção de regras de pronúncia, para fenómenos de sandhi interno e externo (Shcherba, 1957; Litnevskaya, 2006; Lekant, 2007; Popov, 2014). A empresa já empregava, através de um G2P estatístico específico para russo, um inventário fonético para o russo, correspondente à literatura referida anteriormente, mas o mesmo ainda não havia sido validado. Foi possível realizar uma verificação e correção, com base na caracterização dos fones do léxico do russo e nos dados ecológicos obtidos de falantes russos em situações comunicativas diversas. A validação do inventário fonético-fonológico permitiu ainda a consequente validação do léxico de russo. O léxico foi construído com base num conjunto de características (e.g., grafema em posição átona tem como pronúncia correspondente o fone [I] e em posição tónica - [i]; o grafema em posição final de palavra é pronunciado como [- vozeado] - [f]; entre outras características) e foi organizado com base no critério da frequência de uso. No total, foram verificadas dez mil (10 000) palavras mais frequentes do russo, tendo por base as estatísticas resultantes da análise dos conteúdos existentes num repositório de artigos de notícias recolhidos previamente de jornais de referência em língua russa. Foi realizada uma avaliação do sistema de reconhecimento antes e depois da modificação das dez mil palavras mais frequentemente ocorridas no léxico - 19,85% WER antes das modificações, e 19,97% WER depois, com uma diferença de 0,12%. Os resultados preliminares, sem o treino do modelo, não demonstram resultados significativos, porém, observamos uma ligeira melhoria no reconhecimento das palavras mais frequentes, tais como palavras funcionais, acrónimos, verbos, nomes, entre outros. Através destes resultados e com base nas regras criadas a partir da correção das dez mil palavras, pretendemos, no futuro, alargar as mesmas a todo o léxico, constituído por quatrocentas mil (400 000) entradas. Após a validação das transcrições e do léxico, com base na literatura, foi também possível realizar uma análise das pausas preenchidas do russo para a integração no sistema de reconhecimento. O interesse de se incluir também as pausas no reconhecedor automático deveu-se sobretudo a estes mecanismos serem difíceis de identificar automaticamente e poderem ser substituídos ou por afetarem as sequências adjacentes. De acordo com o sistema de anotação da empresa, as pausas preenchidas são marcadas na transcrição com o símbolo de percentagem - %. As pausas preenchidas do russo encontradas na literatura foram %a [a], %am [am] (Rose, 1998; Ten, 2015), %@ [ə], %@m [əm] (Bogdanova-Beglarian & Baeva, 2018) %e [e], %ɨ [ɨ], %m [m] e %n [n] (Harlamova, 2008). Nos dados de áudio disponíveis na referida plataforma, para além das pausas preenchidas mencionadas, foram encontradas mais duas, nomeadamente, %na [na] e %mna [mna], até quanto nos é dado saber, ainda não descritas na literatura. De momento, todas as pausas preenchidas referidas já fazem parte dos modelos de reconhecimento automático de fala para a língua russa. O trabalho desenvolvido durante o estágio, ou seja, a validação dos dados existentes na empresa, foi aplicado ao projeto europeu AIDA - The Artificial Intelligence and Advanced Data Analysis for Authority Agencies. O objetivo principal do presente projeto é de criar uma solução capaz de detetar possíveis crimes informáticos e de terrorismo, utilizando métodos de aprendizagem automática. A principal contribuição da VoiceInteraction para o projeto foi a aplicação do ASR e validação das transcrições do russo (conteúdo relacionado com a religião). Para tal, todas as tarefas realizadas durante a tese foram muito relevantes e aplicadas no âmbito do projeto AIDA. Os resultados da validação das transcrições do projeto, mostraram uma elevada taxa de palavras Fora de Vocabulário (OOV) e uma elevada taxa de Substituição (SUBS). Uma vez que o modelo de língua utilizado neste projeto foi adaptado ao conteúdo noticioso, as palavras relacionadas com a religião não se encontravam neste. Além disso, as palavras funcionais foram incorretamente reconhecidas, na maioria dos casos, devido à coarticulação com a palavra anterior ou a seguinte

Universidade de Lisboa: Repositório.UL

Pause behaviour within reformulations and the proficiency level of second language learners of English

Author: Allen
Brédart
Cenoz
de Jong
Ejzenberg
Goldman-Eisler
Hashemi
Igras-Cybulska
Kormos
Kormos
Leal
Levelt
Malgorzata Korko
O’Connor
Simon A. Williams
Swain
Van Hest
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 14/05/2019
Field of study

This research reports on a quantitative analysis of the combination of two types of disfluency, reformulations and pauses, in the speech of lower intermediate and advanced speakers of English as a second language (L2). The present study distinguishes between corrections and false starts within the category of reformulations as well as between silent and filled pauses. It focuses on the extent to which pause behavior within reformulations varies according to the stage of L2 development and the type of reformulation used. An analysis was made of 56 L2 speakers’ 2-min monologues. The results showed that lower intermediate and advanced speakers differed on the frequency of silent pauses inserted in corrections but not on their frequency in false starts. This suggests that false starts depend less on proficiency level, and may reflect temporary problems with conceptual encoding or extralinguistic factors that contribute to the efficacy of L2 production rather than difficulties with linguistic processing per se. The frequency of silent pauses rather than silent pause duration or the frequency and duration of filled pauses appeared to be the only marker to differentiate between false starts and corrections across the two proficiency groups

Kysyvän funktion vaikutus spontaanin ja luetun suomen intonaatioon

Author: Anttila Hanna
Publication venue: Helsingin yliopisto
Publication date: 01/01/2008
Field of study

Goals This study aims to map the effect of interrogative function on the intonation of spontaneous and read Finnish. Earlier research shows that the most prominent feature in Finnish question intonation is an appeal to the listener. Question word questions typically start with a high peak which is followed by falling intonation. In yes/no questions, F0 remains on a high level until the word carrying sentence stress and then falls. Final rises are mainly found in intonation clichés such as "Ai mitä?" ("What?") These earlier results are based on read speech and enacted dialogues. In this study, questions and statements found in spontaneous dialogues were compared. These utterances were also compared with read versions of the same utterances. Fundamental frequency values were compared using a mixed model. Contours were also grouped using auditory and visual inspection. Thus it was possible to compare frequencies of contour types according to utterance type and speech style. The position of questions in the F0 distribution of the whole material was also investigated in this study. Method he material consisted of four spontaneous dialogues and their read versions. The speakers were young adults from the Helsinki metropolitan area, four females and four males. The whole material was first divided into broad dialogue function categories arising from the material and F0 curves were calculated for each category. After this, 277 questions and 244 statements were selected for closer inspection. Values reflecting F0 distribution and contour shape were measured from the F0 contours of these utterances. A mixed model was used to analyse the differences. Utterance type, question type, speech style and speaker gender were used as fixed effects. The frequencies of F0 contour types were compared using a Chi square test. Additional material in this study came from eight young female speakers in central Finland. Results and conclusions In the mixed model analysis, significant differences were found both between questions and statements and between spontaneous and read speech. Generally, utterance type affected the variables reflecting contour type while speech style affected the variables reflecting F0 distribution. The effect of question type was not clearly visible. In read speech the contours resembled earlier results more closely. Speakers had different strategies in differentiating between questions and statements. In the whole material, F0 was slightly higher in questions than in statements. The effect of dialectal background could be seen in the contour types. The results show that interrogative function affects intonation in both spontaneous and read Finnish.Tavoitteet Tutkimuksen tarkoituksena on selvittää, miten kysyvä funktio vaikuttaa spontaanin ja luetun suomen intonaatioon. Aiemmat tutkimukset osoittavat, että suomen kysymysintonaatiossa voimakkaimmin ilmenevä piirre on vetoomus kuulijaan. Kysymyssanakysymyksille on tyypillistä alun korkea huippu, jonka jälkeen perustaajuus laskee. Tästä poiketen kO-kysymyksissä perustaajuus säilyy korkealla lausepainolliseen sanaan saakka ja laskee vasta sen jälkeen. Nouseva loppu esiintyy lähinnä kiteytyneissä ilmauksissa kuten "Ai mitä?" Aiemmat tulokset perustuvat lukupuhuntaan ja näyteltyihin dialogeihin. Tutkimuksessa verrattiin spontaanipuheesta löytyviä kysymyksiä ja väitteitä keskenään. Toisena vertailukohtana olivat tutkittavat lauseet lukupuhuntana. Lauseista mitattuja perustaajuusarvoja verrattiin tilastollisen monitasomallin avulla. Lisäksi kontuurit tyypiteltiin auditiivisen ja visuaalisen havainnon perusteella. Tämä mahdollisti kontuurityyppien frekvenssien vertailun lausetyypin ja puhetyylin mukaan. Tutkimuksessa tarkasteltiin myös kysymysten asemaa koko aineiston perustaajuusjakaumassa. Menetelmät Tutkimusaineisto koostui neljästä dialogista sekä litteroitujen vuorosanojen luetuista toisinnoista. Puhujat olivat nuoria aikuisia pääkaupunkiseudulta. Kumpaakin sukupuolta edusti neljä puhujaa. Ensin koko aineisto jaettiin väljiin aineistolähtöisiin dialogifunktioluokkiin, joiden perustaajuuskäyrät laskettiin kokonaisuudessaan. Tämän jälkeen rajattiin 277 kysymystä ja 244 väitettä tarkempaa tutkimusta varten. Ilmauksista laskettiin perustaajuuskäyrät, joista mitattiin jakaumaa ja muotoa kuvastavia tunnuslukuja. Tilastollisen monitasomallin avulla etsittiin selittäviä tekijöitä näissä mittaustuloksissa esiintyviin eroihin. Selittäjinä käytettiin lause- ja kysymystyyppiä, puhetyyliä ja puhujan sukupuolta. Kontuurityyppien esiintymistä vertailtiin Khin neliötestin avulla. Täydentävänä aineistona oli lukupuhuntaa kahdeksalta keskisuomalaiselta naispuhujalta. Tulokset ja johtopäätökset Monitasomallinnuksessa merkitseviä eroja löytyi sekä kysymysten ja väitteiden välillä että spontaanipuheen ja lukupuhunnan välillä. Lausetyypillä oli vaikutusta erityisesti kontuurin muotoon ja puhetyylillä taas perustaajuusjakaumaan. Kysymystyypin vaikutus ei tämän kokoisessa aineistossa näkynyt selvästi. Lukupuhunnassa kontuurit muistuttivat selvemmin aiempien tutkimusten tuloksia. Eri puhujilla oli erilaisia tapoja erottaa kysymykset väitteistä. Koko aineiston tasolla perustaajuus oli hieman korkeampi kysymyksissä kuin väitteissä. Murretaustan vaikutus näkyi kontuurityyppien erilaisena jakaumana keskisuomalaisilla puhujilla. Tulokset osoittavat, että kysyvä funktio vaikuttaa intonaatioon sekä spontaanissa että luetussa suomessa

Helsingin yliopiston digitaalinen arkisto

A TUTORIAL ON FORMANT-BASED SPEECH SYNTHESIS FOR THE DOCUMENTATION OF CRITICALLY ENDANGERED LANGUAGES

Author: Koffi Ettien
Petzold Mark
Publication venue: The Repository at St. Cloud State
Publication date: 24/03/2022
Field of study

Smaller languages, that is, those spoken by 5,000 people or less are dying at an alarming rate (Krauss 1992). Many are disappearing without having been studied acoustically. The methodology discussed in this paper can help build formant-based speech synthesis systems for the documentation and revitalization of these languages. Developing Text-to-Speech (TTS) functionalities for use in smart devices can breathe a new life into dying languages (Crystal 2000). In the first tutorial on this topic, Koffi (2020) explained how the Arpabet transcription system can be expanded for use in African languages and beyond. In the present tutorial, Author 1 and Author 2 lay the foundations for formant-based speech synthesis patterned after Klatt (1980) and Klatt and Klatt (1990). Betine, (ISO: 639-3-eot), a critically endangered language in Côte d’Ivoire, West Africa, is used to illustrate the processes involved in building a speech synthesis from the ground up for moribund languages. The steps include constructing a language model, a speaker model, a software model, an intonation model, extracting relevant acoustic phonetic data, and coding them. Ancillary topics such as text normalization, downsampling, and bandwidth calculations are also discussed

St. Cloud State University