163 research outputs found
Comparative study of Arabic and french statistical language models
International audienceIn this paper, we propose a comparative study of statistical language models of Arabic and French. The objective of this study is to understand how to better model both Arabic and French. Several experiments using different smoothing techniques have been carried out. For French, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient. Tests are achieved with comparable corpora and vocabularies in terms of siz
Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information
This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech
An Investigation on Cognitive-Linguistic Skills of English-Chinese Bilingual Learners with and without Dyslexia in Singapore
This thesis investigates dyslexia and the cognitive-linguistics skills, namely phonological awareness,
orthographic knowledge, morphological awareness and rapid naming, of bilingual learners in
Singapore whose first language is English and second language is Chinese. The two main research aims
are to investigate whether the English-Chinese bilingual learners with dyslexia diagnosed only in
English are weaker than their typical counterparts in reading and all cognitive-linguistic skills in both
languages or either language, and to investigate which cognitive-linguistic skills are strong predictors
of reading in each language. Results show that the bilingual learners with dyslexia performed
significantly poorer than their typical counterparts in reading and all cognitive-linguistic skills in both
languages, although their dyslexia were diagnosed only in English. Results also found all English
cognitive-linguistic skills predictive of English word reading, especially the unique predictive roles of
morphological awareness and orthographic knowledge after rapid naming and phonological
awareness were controlled. However, only rapid naming and morphological awareness were found to
be predictive of Chinese word reading. The results suggest that dyslexia may manifest differently in
reading and cognitive-linguistic skills of English and Chinese languages in the English-Chinese bilingual
learners, based on the two different predictive models with different empirically and theoretically
supported orders of cognitive-linguistic skills as predictors for reading development in the two
languages. The difference in the unique contributions of the four cognitive-linguistic skills underlying
the reading development of both languages may suggest the difference lies in language structure and
instruction.
Keywords: dyslexia, bilingualism, English reading, Chinese reading, cognitive-linguistic skill
Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion
The electronic version of this article is the complete one and can be found online at: http://dx.doi.org/10.1186/s13636-015-0063-8Spoken term detection (STD) aims at retrieving data from a speech repository given a textual representation of the search term. Nowadays, it is receiving much interest due to the large volume of multimedia information. STD differs from automatic speech recognition (ASR) in that ASR is interested in all the terms/words that appear in the speech data, whereas STD focuses on a selected list of search terms that must be detected within the speech data. This paper presents the systems submitted to the STD ALBAYZIN 2014 evaluation, held as a part of the ALBAYZIN 2014 evaluation campaign within the context of the IberSPEECH 2014 conference. This is the first STD evaluation that deals with Spanish language. The evaluation consists of retrieving the speech files that contain the search terms, indicating their start and end times within the appropriate speech file, along with a score value that reflects the confidence given to the detection of the search term. The evaluation is conducted on a Spanish spontaneous speech database, which comprises a set of talks from workshops and amounts to about 7 h of speech. We present the database, the evaluation metrics, the systems submitted to the evaluation, the results, and a detailed discussion. Four different research groups took part in the evaluation. Evaluation results show reasonable performance for moderate out-of-vocabulary term rate. This paper compares the systems submitted to the evaluation and makes a deep analysis based on some search term properties (term length, in-vocabulary/out-of-vocabulary terms, single-word/multi-word terms, and in-language/foreign terms).This work has been partly supported by project CMC-V2
(TEC2012-37585-C02-01) from the Spanish Ministry of Economy and
Competitiveness. This research was also funded by the European Regional
Development Fund, the Galician Regional Government (GRC2014/024,
“Consolidation of Research Units: AtlantTIC Project” CN2012/160)
Pronunciation modelling in end-to-end text-to-speech synthesis
Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve
high-quality naturalness scores without extensive processing of text-input. Since S2S
models have been proposed in multiple aspects of the TTS pipeline, the field has focused
on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform
is predicted directly from a sequence of text or phone characters. Early work on E2ETTS
in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation
(lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder
during training. The benefits of a learned text encoding include improved modelling
of phonetic context, which make contextual linguistic features traditionally used in
TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar
naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling
of phonetic context has led some to question the benefit of using phone- instead of
text-input altogether (see [5]).
The use of text-input brings into question the value of the pronunciation lexicon
in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme
(G2P) model from text-audio pairs during training. With common datasets
for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates
compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation
is difficult for some words (e.g. foreign words and proper names) since
the knowledge to disambiguate their pronunciations may not be provided by the local
grapheme context and may require knowledge beyond that contained in sentence-level
text-audio sequences. When test stimuli were selected according to G2P difficulty,
increased mispronunciations in E2E-TTS with text-input were observed. Following
the proposed benefits of subword decomposition in S2S modelling in other language
tasks (e.g. neural machine translation), the effects of morphological decomposition
were investigated on pronunciation modelling. Learning of the French post-lexical
phenomenon liaison was also evaluated.
With the goal of an inexpensive, large-scale evaluation of pronunciation modelling,
the reliability of automatic speech recognition (ASR) to measure TTS intelligibility
was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge
was conducted. ASR reliably found similar significant differences between systems
as paid listeners in controlled conditions in English. An analysis of transcriptions for
words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR
Transformer model used was found to be unreliable in its transcription of difficult G2P
relations due to homophonic transcription and incorrect transcription of words with
difficult G2P relations. A further evaluation of representation mixing in Tacotron finds
pronunciation correction is possible when mixing text- and phone-inputs. The thesis
concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a
pronunciation guide since it can provide assurances that G2P generalisation cannot
- …