    Speech recognition systems and russian pronunciation variation in the context of VoiceInteraction

    The present thesis aims to describe the work performed during the internship for the master’s degree in Linguistics at VoiceInteraction, an international Artificial Intelligence (AI) company, specializing in developing speech processing technologies. The goal of the internship was to study phonetic characteristics of the Russian language, attending to four main tasks: description of the phonetic-phonological inventory; validation of transcriptions of broadcast news; validation of a previously created lexicon composed by ten thousand (10 000) most frequently observed words in a text corpus crawled from Russian reference newspapers websites; and integration of filled pauses into the Automatic Speech Recognizer (ASR). Initially, a collection of audio and text broadcast news media from Russian-speaking regions, European Russian, Belarus, and the Caucasus Region, featuring different varieties of Russian was conducted. The extracted data and the company's existing data were used to train the acoustic, pronunciation, and language models. The audio data was automatically processed in a proprietary platform and then revised by human annotators. Transcriptions produced automatically and reviewed by annotators were analyzed, and the most common errors were extracted to provide feedback to the community of annotators. The validation of transcriptions, along with the annotation of all of the disfluencies (that previously were left out), resulted in the decrease of Word Error Rate (WER) in most cases. In some cases (in European Russian transcriptions), WER increased, the models were not sufficiently effective to identify the correct words, potentially problematic. Also, audio with overlapped speech, disfluencies, and acoustic events can impact the WER. Since we used the model that was only trained with European Russian to recognize other varieties of Russian language, it resulted in high WER for Belarus and the Caucasus region. The characterization of the Russian phonetic-phonological inventory and the construction of pronunciation rules for internal and external sandhi phenomena were performed for the validation of the lexicon – ten thousand of the most frequently observed words in a text corpus crawled from Russian reference newspapers websites, were revised and modified for the extraction of linguistic patterns to be used in a statistical Grapheme-to-phone (G2P) model. Two evaluations were conducted: before the modifications to the lexicon and after. Preliminary results without training the model show no significant results - 19.85% WER before the modifications, and 19.97% WER after, with a difference of 0.12%. However, we observed a slight improvement of the most frequent words. In the future, we aim to extend the analysis of the lexicon to the 400 000 entries (total lexicon size), analyze the type of errors that are produced, decrease the word error rate (WER), and analyze acoustic models, as well. In this work, we also studied filled pauses, since we believe that research on filled pauses for the Russian language can improve the recognition system of VoiceInteraction, by reducing the processing time and increasing the quality. These are marked in the transcriptions with “%”. In Russian, according to the literature (Ten, 2015; Harlamova, 2008; Bogradonova-Belgarian & Baeva, 2018), these are %a [a], %am [am], %@ [ə], %@m [əm], %e [e], %ɨ [ɨ], %m [m], and %n [n]. In the speech data, two more filled pauses were found, namely, %na [na] and %mna [mna], as far as we know, not yet referenced in the literature. Finally, the work performed during an internship contributed to a European project - Artificial Intelligence and Advanced Data Analysis for Authority Agencies (AIDA). The main goal of the present project is to build a solution capable of automating the processing of large amounts of data that Law Enforcement Agencies (LEAs) have to analyze in the investigations of Terrorism and Cybercrime, using pioneering machine learning and artificial intelligence methods. VoiceInteraction's main contribution to the project was to apply ASR and validate the transcriptions of the Russian (religious-related content). In order to do so, all the tasks performed during the thesis were very relevant and applied in the scope of the AIDA project. Transcription analysis results from the AIDA project showed a high Out-of-Vocabulary (OOV) rate and high substitution (SUBS) rate. Since the language model used in this project was adapted for broadcast content, the religious-related words were left out. Also, function words were incorrectly recognized, in most cases, due to coarticulation with the previous or the following word.A presente tese descreve o trabalho que foi realizado no âmbito de um estágio em linguística computacional na VoiceInteraction, uma empresa de tecnologias de processamento de fala. Desde o início da sua atividade, a empresa tem-se dedicado ao desenvolvimento de tecnologia própria em várias áreas do processamento computacional da fala, entre elas, síntese de fala, processamento de língua natural e reconhecimento automático de fala, representando esta última a principal área de negócio da empresa. A tecnologia de reconhecimento de automático de fala da VoiceInteraction explora a utilização de modelos híbridos em combinação com as redes neuronais (DNN - Deep Neural Networks), que, segundo Lüscher et al. (2019), apresenta um melhor desempenho, quando comparado com modelos de end-to-end apenas. O objetivo principal do estágio focou-se no estudo da fonética da língua russa, atendendo a quatro tarefas: criação do inventário fonético-fonológico; validação das transcrições de noticiários; validação do léxico previamente criado e integração de pausas preenchidas no sistema. Inicialmente, foi realizada uma recolha dos principais meios de comunicação (áudio e texto), apresentando diferentes variedades do russo, nomeadamente, da Rússia Europeia, Bielorrússia e Cáucaso Central. Na Rússia europeia o russo é a língua oficial, na Bielorrússia o russo faz parte das línguas oficiais do país, e na região do Cáucaso Central, o russo é usado como língua franca, visto que este era falado na União Soviética e continua até hoje a ser falado nas regiões pós-Soviéticas. Tratou-se de abranger a maior cobertura possível da língua russa e neste momento apenas foi possível recolher os dados das variedades mencionadas. Os dados extraídos de momento, juntamente com os dados já existentes na empresa, foram utilizados no treino dos modelos acústicos, modelos de pronúncia e modelos de língua. Para o tratamento dos dados de áudio, estes foram inseridos numa plataforma proprietária da empresa, Calligraphus, que, para além de fornecer uma interface de transcrição para os anotadores humanos poderem transcrever os conteúdos, efetua também uma sugestão de transcrição automática desses mesmos conteúdos, a fim de diminuir o esforço despendido pelos anotadores na tarefa. De seguida, as transcrições foram analisadas, de forma a garantir que o sistema de anotação criado pela VoiceInteraction foi seguido, indicando todas as disfluências de fala (fenómenos característicos da edição da fala), tais como prolongamentos, pausas preenchidas, repetições, entre outros e transcrevendo a fala o mais próximo da realidade. Posteriormente, os erros sistemáticos foram analisados e exportados, de forma a fornecer orientações e sugestões de melhoria aos anotadores humanos e, por outro lado, melhorar o desempenho do sistema de reconhecimento. Após a validação das transcrições, juntamente com a anotação de todas as disfluências (que anteriormente eram deixadas de fora), observamos uma diminuição de WER, na maioria dos casos, tal como esperado. Porém, em alguns casos, observamos um aumento do WER. Apesar das correções efetuadas aos ficheiros analisados, os modelos não foram suficientemente eficazes no reconhecimento das palavras corretas, potencialmente problemáticas. A elevada taxa de WER nos áudios com debates políticos, está relacionada com uma maior frequência de fala sobreposta e disfluências (e.g., pausas preenchidas, prolongamentos). O modelo utilizado para reconhecer todas as variedades foi treinado apenas com a variedade de russo europeu e, por isso, o WER alto também foi observado para as variedades da Bielorrússia e para a região do Cáucaso. Numa perspetiva baseada em dados coletados pela empresa, foi realizada, de igual modo, uma caracterização e descrição do inventário fonético-fonológico do russo e a construção de regras de pronúncia, para fenómenos de sandhi interno e externo (Shcherba, 1957; Litnevskaya, 2006; Lekant, 2007; Popov, 2014). A empresa já empregava, através de um G2P estatístico específico para russo, um inventário fonético para o russo, correspondente à literatura referida anteriormente, mas o mesmo ainda não havia sido validado. Foi possível realizar uma verificação e correção, com base na caracterização dos fones do léxico do russo e nos dados ecológicos obtidos de falantes russos em situações comunicativas diversas. A validação do inventário fonético-fonológico permitiu ainda a consequente validação do léxico de russo. O léxico foi construído com base num conjunto de características (e.g., grafema em posição átona tem como pronúncia correspondente o fone [I] e em posição tónica - [i]; o grafema em posição final de palavra é pronunciado como [- vozeado] - [f]; entre outras características) e foi organizado com base no critério da frequência de uso. No total, foram verificadas dez mil (10 000) palavras mais frequentes do russo, tendo por base as estatísticas resultantes da análise dos conteúdos existentes num repositório de artigos de notícias recolhidos previamente de jornais de referência em língua russa. Foi realizada uma avaliação do sistema de reconhecimento antes e depois da modificação das dez mil palavras mais frequentemente ocorridas no léxico - 19,85% WER antes das modificações, e 19,97% WER depois, com uma diferença de 0,12%. Os resultados preliminares, sem o treino do modelo, não demonstram resultados significativos, porém, observamos uma ligeira melhoria no reconhecimento das palavras mais frequentes, tais como palavras funcionais, acrónimos, verbos, nomes, entre outros. Através destes resultados e com base nas regras criadas a partir da correção das dez mil palavras, pretendemos, no futuro, alargar as mesmas a todo o léxico, constituído por quatrocentas mil (400 000) entradas. Após a validação das transcrições e do léxico, com base na literatura, foi também possível realizar uma análise das pausas preenchidas do russo para a integração no sistema de reconhecimento. O interesse de se incluir também as pausas no reconhecedor automático deveu-se sobretudo a estes mecanismos serem difíceis de identificar automaticamente e poderem ser substituídos ou por afetarem as sequências adjacentes. De acordo com o sistema de anotação da empresa, as pausas preenchidas são marcadas na transcrição com o símbolo de percentagem - %. As pausas preenchidas do russo encontradas na literatura foram %a [a], %am [am] (Rose, 1998; Ten, 2015), %@ [ə], %@m [əm] (Bogdanova-Beglarian & Baeva, 2018) %e [e], %ɨ [ɨ], %m [m] e %n [n] (Harlamova, 2008). Nos dados de áudio disponíveis na referida plataforma, para além das pausas preenchidas mencionadas, foram encontradas mais duas, nomeadamente, %na [na] e %mna [mna], até quanto nos é dado saber, ainda não descritas na literatura. De momento, todas as pausas preenchidas referidas já fazem parte dos modelos de reconhecimento automático de fala para a língua russa. O trabalho desenvolvido durante o estágio, ou seja, a validação dos dados existentes na empresa, foi aplicado ao projeto europeu AIDA - The Artificial Intelligence and Advanced Data Analysis for Authority Agencies. O objetivo principal do presente projeto é de criar uma solução capaz de detetar possíveis crimes informáticos e de terrorismo, utilizando métodos de aprendizagem automática. A principal contribuição da VoiceInteraction para o projeto foi a aplicação do ASR e validação das transcrições do russo (conteúdo relacionado com a religião). Para tal, todas as tarefas realizadas durante a tese foram muito relevantes e aplicadas no âmbito do projeto AIDA. Os resultados da validação das transcrições do projeto, mostraram uma elevada taxa de palavras Fora de Vocabulário (OOV) e uma elevada taxa de Substituição (SUBS). Uma vez que o modelo de língua utilizado neste projeto foi adaptado ao conteúdo noticioso, as palavras relacionadas com a religião não se encontravam neste. Além disso, as palavras funcionais foram incorretamente reconhecidas, na maioria dos casos, devido à coarticulação com a palavra anterior ou a seguinte

    Spoken content retrieval: A survey of techniques and technologies

    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Détection et caractérisation des régions d'erreurs dans des transcriptions de contenus multimédia : application à la recherche des noms de personnes

    International audienceDans cet article, nous proposons de détecter et de caractériser des régions d'erreurs dans des transcriptions automatiques de contenus multimédia. La détection et la caractérisation simultanée des régions d'erreurs peut être vue comme une tâche d'étiquetage de séquences pour laquelle nous comparons des approches séquentielles (segmentation puis classification) et une approche intégrée. Nous comparons les performances de notre système sur deux corpus différents en faisant varier les données d'apprentissage. Nous nous intéressons particulièrement aux erreurs des noms de personnes, information essentielle dans de nombreuses applications d'extraction d'information. Les résultats obtenus confirment l'intérêt d'une méthode à base d'apprentissage exploitant le contexte d'apparition des erreurs

    Accessing spoken interaction through dialogue processing [online]

    Zusammenfassung Unser Leben, unsere Leistungen und unsere Umgebung, alles wird derzeit durch Schriftsprache dokumentiert. Die rasante Fortentwicklung der technischen Möglichkeiten Audio, Bilder und Video aufzunehmen, abzuspeichern und wiederzugeben kann genutzt werden um die schriftliche Dokumentation von menschlicher Kommunikation, zum Beispiel Meetings, zu unterstützen, zu ergänzen oder gar zu ersetzen. Diese neuen Technologien können uns in die Lage versetzen Information aufzunehmen, die anderweitig verloren gehen, die Kosten der Dokumentation zu senken und hochwertige Dokumente mit audiovisuellem Material anzureichern. Die Indizierung solcher Aufnahmen stellt die Kerntechnologie dar um dieses Potential auszuschöpfen. Diese Arbeit stellt effektive Alternativen zu schlüsselwortbasierten Indizes vor, die Suchraumeinschränkungen bewirken und teilweise mit einfachen Mitteln zu berechnen sind. Die Indizierung von Sprachdokumenten kann auf verschiedenen Ebenen erfolgen: Ein Dokument gehört stilistisch einer bestimmten Datenbasis an, welche durch sehr einfache Merkmale bei hoher Genauigkeit automatisch bestimmt werden kann. Durch diese Art von Klassifikation kann eine Reduktion des Suchraumes um einen Faktor der Größenordnung 4­10 erfolgen. Die Anwendung von thematischen Merkmalen zur Textklassifikation bei einer Nachrichtendatenbank resultiert in einer Reduktion um einen Faktor 18. Da Sprachdokumente sehr lang sein können müssen sie in thematische Segmente unterteilt werden. Ein neuer probabilistischer Ansatz sowie neue Merkmale (Sprecherinitia­ tive und Stil) liefern vergleichbare oder bessere Resultate als traditionelle schlüsselwortbasierte Ansätze. Diese thematische Segmente können durch die vorherrschende Aktivität charakterisiert werden (erzählen, diskutieren, planen, ...), die durch ein neuronales Netz detektiert werden kann. Die Detektionsraten sind allerdings begrenzt da auch Menschen diese Aktivitäten nur ungenau bestimmen. Eine maximale Reduktion des Suchraumes um den Faktor 6 ist bei den verwendeten Daten theoretisch möglich. Eine thematische Klassifikation dieser Segmente wurde ebenfalls auf einer Datenbasis durchgeführt, die Detektionsraten für diesen Index sind jedoch gering. Auf der Ebene der einzelnen Äußerungen können Dialogakte wie Aussagen, Fragen, Rückmeldungen (aha, ach ja, echt?, ...) usw. mit einem diskriminativ trainierten Hidden Markov Model erkannt werden. Dieses Verfahren kann um die Erkennung von kurzen Folgen wie Frage/Antwort­Spielen erweitert werden (Dialogspiele). Dialogakte und ­spiele können eingesetzt werden um Klassifikatoren für globale Sprechstile zu bauen. Ebenso könnte ein Benutzer sich an eine bestimmte Dialogaktsequenz erinnern und versuchen, diese in einer grafischen Repräsentation wiederzufinden. In einer Studie mit sehr pessimistischen Annahmen konnten Benutzer eines aus vier ähnlichen und gleichwahrscheinlichen Gesprächen mit einer Genauigkeit von ~ 43% durch eine graphische Repräsentation von Aktivität bestimmt. Dialogakte könnte in diesem Szenario ebenso nützlich sein, die Benutzerstudie konnte aufgrund der geringen Datenmenge darüber keinen endgültigen Aufschluß geben. Die Studie konnte allerdings für detailierte Basismerkmale wie Formalität und Sprecheridentität keinen Effekt zeigen. Abstract Written language is one of our primary means for documenting our lives, achievements, and environment. Our capabilities to record, store and retrieve audio, still pictures, and video are undergoing a revolution and may support, supplement or even replace written documentation. This technology enables us to record information that would otherwise be lost, lower the cost of documentation and enhance high­quality documents with original audiovisual material. The indexing of the audio material is the key technology to realize those benefits. This work presents effective alternatives to keyword based indices which restrict the search space and may in part be calculated with very limited resources. Indexing speech documents can be done at a various levels: Stylistically a document belongs to a certain database which can be determined automatically with high accuracy using very simple features. The resulting factor in search space reduction is in the order of 4­10 while topic classification yielded a factor of 18 in a news domain. Since documents can be very long they need to be segmented into topical regions. A new probabilistic segmentation framework as well as new features (speaker initiative and style) prove to be very effective compared to traditional keyword based methods. At the topical segment level activities (storytelling, discussing, planning, ...) can be detected using a machine learning approach with limited accuracy; however even human annotators do not annotate them very reliably. A maximum search space reduction factor of 6 is theoretically possible on the databases used. A topical classification of these regions has been attempted on one database, the detection accuracy for that index, however, was very low. At the utterance level dialogue acts such as statements, questions, backchannels (aha, yeah, ...), etc. are being recognized using a novel discriminatively trained HMM procedure. The procedure can be extended to recognize short sequences such as question/answer pairs, so called dialogue games. Dialog acts and games are useful for building classifiers for speaking style. Similarily a user may remember a certain dialog act sequence and may search for it in a graphical representation. In a study with very pessimistic assumptions users are able to pick one out of four similar and equiprobable meetings correctly with an accuracy ~ 43% using graphical activity information. Dialogue acts may be useful in this situation as well but the sample size did not allow to draw final conclusions. However the user study fails to show any effect for detailed basic features such as formality or speaker identity

    Articulatory features for conversational speech recognition

    Multimedia Retrieval

    Speech Recognition

    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Automatic speech recognition of Cantonese-English code-mixing utterances.

    Chan Yeuk Chi Joyce.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references.Abstracts in English and Chinese.Chapter Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Background --- p.1Chapter 1.2 --- Previous Work on Code-switching Speech Recognition --- p.2Chapter 1.2.1 --- Keyword Spotting Approach --- p.3Chapter 1.2.2 --- Translation Approach --- p.4Chapter 1.2.3 --- Language Boundary Detection --- p.6Chapter 1.3 --- Motivations of Our Work --- p.7Chapter 1.4 --- Methodology --- p.8Chapter 1.5 --- Thesis Outline --- p.10Chapter 1.6 --- References --- p.11Chapter Chapter 2 --- Fundamentals of Large Vocabulary Continuous Speech Recognition for Cantonese and English --- p.14Chapter 2.1 --- Basic Theory of Speech Recognition --- p.14Chapter 2.1.1 --- Feature Extraction --- p.14Chapter 2.1.2 --- Maximum a Posteriori (MAP) Probability --- p.15Chapter 2.1.3 --- Hidden Markov Model (HMM) --- p.16Chapter 2.1.4 --- Statistical Language Modeling --- p.17Chapter 2.1.5 --- Search A lgorithm --- p.18Chapter 2.2 --- Word Posterior Probability (WPP) --- p.19Chapter 2.3 --- Generalized Word Posterior Probability (GWPP) --- p.23Chapter 2.4 --- Characteristics of Cantonese --- p.24Chapter 2.4.1 --- Cantonese Phonology --- p.24Chapter 2.4.2 --- Variation and Change in Pronunciation --- p.27Chapter 2.4.3 --- Syllables and Characters in Cantonese --- p.28Chapter 2.4.4 --- Spoken Cantonese vs. Written Chinese --- p.28Chapter 2.5 --- Characteristics of English --- p.30Chapter 2.5.1 --- English Phonology --- p.30Chapter 2.5.2 --- English with Cantonese Accents --- p.31Chapter 2.6 --- References --- p.32Chapter Chapter 3 --- Code-mixing and Code-switching Speech Recognition --- p.35Chapter 3.1 --- Introduction --- p.35Chapter 3.2 --- Definition --- p.35Chapter 3.2.1 --- Monolingual Speech Recognition --- p.35Chapter 3.2.2 --- Multilingual Speech Recognition --- p.35Chapter 3.2.3 --- Code-mixing and Code-switching --- p.36Chapter 3.3 --- Conversation in Hong Kong --- p.38Chapter 3.3.1 --- Language Choice of Hong Kong People --- p.38Chapter 3.3.2 --- Reasons for Code-mixing in Hong Kong --- p.40Chapter 3.3.3 --- How Does Code-mixing Occur? --- p.41Chapter 3.4 --- Difficulties for Code-mixing - Specific to Cantonese-English --- p.44Chapter 3.4.1 --- Phonetic Differences --- p.45Chapter 3.4.2 --- Phonology difference --- p.48Chapter 3.4.3 --- Accent and Borrowing --- p.49Chapter 3.4.4 --- Lexicon and Grammar --- p.49Chapter 3.4.5 --- Lack of Appropriate Speech Corpus --- p.50Chapter 3.5 --- References --- p.50Chapter Chapter 4 --- Data Collection --- p.53Chapter 4.1 --- Data Collection --- p.53Chapter 4.1.1 --- Corpus Design --- p.53Chapter 4.1.2 --- Recording Setup --- p.59Chapter 4.1.3 --- Post-processing of Speech Data --- p.60Chapter 4.2 --- A Baseline Database --- p.61Chapter 4.2.1 --- Monolingual Spoken Cantonese Speech Data (CUMIX) --- p.61Chapter 4.3 --- References --- p.61Chapter Chapter 5 --- System Design and Experimental Setup --- p.63Chapter 5.1 --- Overview of the Code-mixing Speech Recognizer --- p.63Chapter 5.1.1 --- Bilingual Syllable / Word-based Speech Recognizer --- p.63Chapter 5.1.2 --- Language Boundary Detection --- p.64Chapter 5.1.3 --- Generalized Word Posterior Probability (GWPP) --- p.65Chapter 5.2 --- Acoustic Modeling --- p.66Chapter 5.2.1 --- Speech Corpus for Training of Acoustic Models --- p.67Chapter 5.2.2 --- Features Extraction --- p.69Chapter 5.2.3 --- Variability in the Speech Signal --- p.69Chapter 5.2.4 --- Language Dependency of the Acoustic Models --- p.71Chapter 5.2.5 --- Pronunciation Dictionary --- p.80Chapter 5.2.6 --- The Training Process of Acoustic Models --- p.83Chapter 5.2.7 --- Decoding and Evaluation --- p.88Chapter 5.3 --- Language Modeling --- p.90Chapter 5.3.1 --- N-gram Language Model --- p.91Chapter 5.3.2 --- Difficulties in Data Collection --- p.91Chapter 5.3.3 --- Text Data for Training Language Model --- p.92Chapter 5.3.4 --- Training Tools --- p.95Chapter 5.3.5 --- Training Procedure --- p.95Chapter 5.3.6 --- Evaluation of the Language Models --- p.98Chapter 5.4 --- Language Boundary Detection --- p.99Chapter 5.4.1 --- Phone-based LBD --- p.100Chapter 5.4.2 --- Syllable-based LBD --- p.104Chapter 5.4.3 --- LBD Based on Syllable Lattice --- p.106Chapter 5.5 --- "Integration of the Acoustic Model Scores, Language Model Scores and Language Boundary Information" --- p.107Chapter 5.5.1 --- Integration of Acoustic Model Scores and Language Boundary Information. --- p.107Chapter 5.5.2 --- Integration of Modified Acoustic Model Scores and Language Model Scores --- p.109Chapter 5.5.3 --- Evaluation Criterion --- p.111Chapter 5.6 --- References --- p.112Chapter Chapter 6 --- Results and Analysis --- p.118Chapter 6.1 --- Speech Data for Development and Evaluation --- p.118Chapter 6.1.1 --- Development Data --- p.118Chapter 6.1.2 --- Testing Data --- p.118Chapter 6.2 --- Performance of Different Acoustic Units --- p.119Chapter 6.2.1 --- Analysis of Results --- p.120Chapter 6.3 --- Language Boundary Detection --- p.122Chapter 6.3.1 --- Phone-based Language Boundary Detection --- p.123Chapter 6.3.2 --- Syllable-based Language Boundary Detection (SYL LB) --- p.127Chapter 6.3.3 --- Language Boundary Detection Based on Syllable Lattice (BILINGUAL LBD) --- p.129Chapter 6.3.4 --- Observations --- p.129Chapter 6.4 --- Evaluation of the Language Models --- p.130Chapter 6.4.1 --- Character Perplexity --- p.130Chapter 6.4.2 --- Phonetic-to-text Conversion Rate --- p.131Chapter 6.4.3 --- Observations --- p.131Chapter 6.5 --- Character Error Rate --- p.132Chapter 6.5.1 --- Without Language Boundary Information --- p.133Chapter 6.5.2 --- With Language Boundary Detector SYL LBD --- p.134Chapter 6.5.3 --- With Language Boundary Detector BILINGUAL-LBD --- p.136Chapter 6.5.4 --- Observations --- p.138Chapter 6.6 --- References --- p.141Chapter Chapter 7 --- Conclusions and Suggestions for Future Work --- p.143Chapter 7.1 --- Conclusion --- p.143Chapter 7.1.1 --- Difficulties and Solutions --- p.144Chapter 7.2 --- Suggestions for Future Work --- p.149Chapter 7.2.1 --- Acoustic Modeling --- p.149Chapter 7.2.2 --- Pronunciation Modeling --- p.149Chapter 7.2.3 --- Language Modeling --- p.150Chapter 7.2.4 --- Speech Data --- p.150Chapter 7.2.5 --- Language Boundary Detection --- p.151Chapter 7.3 --- References --- p.151Appendix A Code-mixing Utterances in Training Set of CUMIX --- p.152Appendix B Code-mixing Utterances in Testing Set of CUMIX --- p.175Appendix C Usage of Speech Data in CUMIX --- p.20

    Feature extraction and event detection for automatic speech recognition

