5,173 research outputs found

    A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition

    Get PDF
    We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s

    The diachronic emergence of retroflex segments in three languages

    Get PDF
    The present study shows that though retroflex segments can be considered articulatorily marked, there are perceptual reasons why languages introduce this class into their phoneme inventory. This observation is illustrated with the diachronic developments of retroflexes in Norwegian (North- Germanic), Nyawaygi (Australian) and Minto-Nenana (Athapaskan). The developments in these three languages are modelled in a perceptually oriented phonological theory, since traditional articulatorily-based features cannot deal with such processes

    How speaker tongue and name source language affect the automatic recognition of spoken names

    Get PDF
    In this paper the automatic recognition of person names and geographical names uttered by native and non-native speakers is examined in an experimental set-up. The major aim was to raise our understanding of how well and under which circumstances previously proposed methods of multilingual pronunciation modeling and multilingual acoustic modeling contribute to a better name recognition in a cross-lingual context. To come to a meaningful interpretation of results we have categorized each language according to the amount of exposure a native speaker is expected to have had to this language. After having interpreted our results we have also tried to find an answer to the question of how much further improvement one might be able to attain with a more advanced pronunciation modeling technique which we plan to develop

    Towards an automatic speech recognition system for use by deaf students in lectures

    Get PDF
    According to the Royal National Institute for Deaf people there are nearly 7.5 million hearing-impaired people in Great Britain. Human-operated machine transcription systems, such as Palantype, achieve low word error rates in real-time. The disadvantage is that they are very expensive to use because of the difficulty in training operators, making them impractical for everyday use in higher education. Existing automatic speech recognition systems also achieve low word error rates, the disadvantages being that they work for read speech in a restricted domain. Moving a system to a new domain requires a large amount of relevant data, for training acoustic and language models. The adopted solution makes use of an existing continuous speech phoneme recognition system as a front-end to a word recognition sub-system. The subsystem generates a lattice of word hypotheses using dynamic programming with robust parameter estimation obtained using evolutionary programming. Sentence hypotheses are obtained by parsing the word lattice using a beam search and contributing knowledge consisting of anti-grammar rules, that check the syntactic incorrectness’ of word sequences, and word frequency information. On an unseen spontaneous lecture taken from the Lund Corpus and using a dictionary containing "2637 words, the system achieved 815% words correct with 15% simulated phoneme error, and 73.1% words correct with 25% simulated phoneme error. The system was also evaluated on 113 Wall Street Journal sentences. The achievements of the work are a domain independent method, using the anti- grammar, to reduce the word lattice search space whilst allowing normal spontaneous English to be spoken; a system designed to allow integration with new sources of knowledge, such as semantics or prosody, providing a test-bench for determining the impact of different knowledge upon word lattice parsing without the need for the underlying speech recognition hardware; the robustness of the word lattice generation using parameters that withstand changes in vocabulary and domain

    Voice technology and BBN

    Get PDF
    The following research was discussed: (1) speech signal processing; (2) automatic speech recognition; (3) continuous speech understanding; (4) speaker recognition; (5) speech compression; (6) subjective and objective evaluation of speech communication system; (7) measurement of the intelligibility and quality of speech when degraded by noise or other masking stimuli; (8) speech synthesis; (9) instructional aids for second-language learning and for training of the deaf; and (10) investigation of speech correlates of psychological stress. Experimental psychology, control systems, and human factors engineering, which are often relevant to the proper design and operation of speech systems are described

    Do (and say) as I say: Linguistic adaptation in human-computer dialogs

    Get PDF
    © Theodora Koulouri, Stanislao Lauria, and Robert D. Macredie. This article has been made available through the Brunel Open Access Publishing Fund.There is strong research evidence showing that people naturally align to each other’s vocabulary, sentence structure, and acoustic features in dialog, yet little is known about how the alignment mechanism operates in the interaction between users and computer systems let alone how it may be exploited to improve the efficiency of the interaction. This article provides an account of lexical alignment in human–computer dialogs, based on empirical data collected in a simulated human–computer interaction scenario. The results indicate that alignment is present, resulting in the gradual reduction and stabilization of the vocabulary-in-use, and that it is also reciprocal. Further, the results suggest that when system and user errors occur, the development of alignment is temporarily disrupted and users tend to introduce novel words to the dialog. The results also indicate that alignment in human–computer interaction may have a strong strategic component and is used as a resource to compensate for less optimal (visually impoverished) interaction conditions. Moreover, lower alignment is associated with less successful interaction, as measured by user perceptions. The article distills the results of the study into design recommendations for human–computer dialog systems and uses them to outline a model of dialog management that supports and exploits alignment through mechanisms for in-use adaptation of the system’s grammar and lexicon

    Modelo acĂșstico de lĂ­ngua inglesa falada por portugueses

    Get PDF
    Trabalho de projecto de mestrado em Engenharia InformĂĄtica, apresentado Ă  Universidade de Lisboa, atravĂ©s da Faculdade de CiĂȘncias, 2007No contexto do reconhecimento robusto de fala baseado em modelos de Markov nĂŁo observĂĄveis (do inglĂȘs Hidden Markov Models - HMMs) este trabalho descreve algumas metodologias e experiĂȘncias tendo em vista o reconhecimento de oradores estrangeiros. Quando falamos em Reconhecimento de Fala falamos obrigatoriamente em Modelos AcĂșsticos tambĂ©m. Os modelos acĂșsticos reflectem a maneira como pronunciamos/articulamos uma lĂ­ngua, modelando a sequĂȘncia de sons emitidos aquando da fala. Essa modelação assenta em segmentos de fala mĂ­nimos, os fones, para os quais existe um conjunto de sĂ­mbolos/alfabetos que representam a sua pronunciação. É no campo da fonĂ©tica articulatĂłria e acĂșstica que se estuda a representação desses sĂ­mbolos, sua articulação e pronunciação. Conseguimos descrever palavras analisando as unidades que as constituem, os fones. Um reconhecedor de fala interpreta o sinal de entrada, a fala, como uma sequĂȘncia de sĂ­mbolos codificados. Para isso, o sinal Ă© fragmentado em observaçÔes de sensivelmente 10 milissegundos cada, reduzindo assim o factor de anĂĄlise ao intervalo de tempo onde as caracterĂ­sticas de um segmento de som nĂŁo variam. Os modelos acĂșsticos dĂŁo-nos uma noção sobre a probabilidade de uma determinada observação corresponder a uma determinada entidade. É, portanto, atravĂ©s de modelos sobre as entidades do vocabulĂĄrio a reconhecer que Ă© possĂ­vel voltar a juntar esses fragmentos de som. Os modelos desenvolvidos neste trabalho sĂŁo baseados em HMMs. Chamam-se assim por se fundamentarem nas cadeias de Markov (1856 - 1922): sequĂȘncias de estados onde cada estado Ă© condicionado pelo seu anterior. Localizando esta abordagem no nosso domĂ­nio, hĂĄ que construir um conjunto de modelos - um para cada classe de sons a reconhecer - que serĂŁo treinados por dados de treino. Os dados sĂŁo ficheiros ĂĄudio e respectivas transcriçÔes (ao nĂ­vel da palavra) de modo a que seja possĂ­vel decompor essa transcrição em fones e alinhĂĄ-la a cada som do ficheiro ĂĄudio correspondente. Usando um modelo de estados, onde cada estado representa uma observação ou segmento de fala descrita, os dados vĂŁo-se reagrupando de maneira a criar modelos estatĂ­sticos, cada vez mais fidedignos, que consistam em representaçÔes das entidades da fala de uma determinada lĂ­ngua. O reconhecimento por parte de oradores estrangeiros com pronuncias diferentes da lĂ­ngua para qual o reconhecedor foi concebido, pode ser um grande problema para precisĂŁo de um reconhecedor. Esta variação pode ser ainda mais problemĂĄtica que a variação dialectal de uma determinada lĂ­ngua, isto porque depende do conhecimento que cada orador tĂȘm relativamente Ă  lĂ­ngua estrangeira. Usando para uma pequena quantidade ĂĄudio de oradores estrangeiros para o treino de novos modelos acĂșsticos, foram efectuadas diversas experiĂȘncias usando corpora de Portugueses a falar InglĂȘs, de PortuguĂȘs Europeu e de InglĂȘs. Inicialmente foi explorado o comportamento, separadamente, dos modelos de Ingleses nativos e Portugueses nativos, quando testados com os corpora de teste (teste com nativos e teste com nĂŁo nativos). De seguida foi treinado um outro modelo usando em simultĂąneo como corpus de treino, o ĂĄudio de Portugueses a falar InglĂȘs e o de Ingleses nativos. Uma outra experiĂȘncia levada a cabo teve em conta o uso de tĂ©cnicas de adaptação, tal como a tĂ©cnica MLLR, do inglĂȘs Maximum Likelihood Linear Regression. Esta Ășltima permite a adaptação de uma determinada caracterĂ­stica do orador, neste caso o sotaque estrangeiro, a um determinado modelo inicial. Com uma pequena quantidade de dados representando a caracterĂ­stica que se quer modelar, esta tĂ©cnica calcula um conjunto de transformaçÔes que serĂŁo aplicadas ao modelo que se quer adaptar. Foi tambĂ©m explorado o campo da modelação fonĂ©tica onde estudou-se como Ă© que o orador estrangeiro pronuncia a lĂ­ngua estrangeira, neste caso um PortuguĂȘs a falar InglĂȘs. Este estudo foi feito com a ajuda de um linguista, o qual definiu um conjunto de fones, resultado do mapeamento do inventĂĄrio de fones do InglĂȘs para o PortuguĂȘs, que representam o InglĂȘs falado por Portugueses de um determinado grupo de prestĂ­gio. Dada a grande variabilidade de pronĂșncias teve de se definir este grupo tendo em conta o nĂ­vel de literacia dos oradores. Este estudo foi posteriormente usado na criação de um novo modelo treinado com os corpora de Portugueses a falar InglĂȘs e de Portugueses nativos. Desta forma representamos um reconhecedor de PortuguĂȘs nativo onde o reconhecimento de termos ingleses Ă© possĂ­vel. Tendo em conta a temĂĄtica do reconhecimento de fala este projecto focou tambĂ©m a recolha de corpora para portuguĂȘs europeu e a compilação de um lĂ©xico de PortuguĂȘs europeu. Na ĂĄrea de aquisição de corpora o autor esteve envolvido na extracção e preparação dos dados de fala telefĂłnica, para posterior treino de novos modelos acĂșsticos de portuguĂȘs europeu. Para compilação do lĂ©xico de portuguĂȘs europeu usou-se um mĂ©todo incremental semi-automĂĄtico. Este mĂ©todo consistiu em gerar automaticamente a pronunciação de grupos de 10 mil palavras, sendo cada grupo revisto e corrigido por um linguista. Cada grupo de palavras revistas era posteriormente usado para melhorar as regras de geração automĂĄtica de pronunciaçÔes.The tremendous growth of technology has increased the need of integration of spoken language technologies into our daily applications, providing an easy and natural access to information. These applications are of different nature with different user’s interfaces. Besides voice enabled Internet portals or tourist information systems, automatic speech recognition systems can be used in home user’s experiences where TV and other appliances could be voice controlled, discarding keyboards or mouse interfaces, or in mobile phones and palm-sized computers for a hands-free and eyes-free manipulation. The development of these systems causes several known difficulties. One of them concerns the recognizer accuracy on dealing with non-native speakers with different phonetic pronunciations of a given language. The non-native accent can be more problematic than a dialect variation on the language. This mismatch depends on the individual speaking proficiency and speaker’s mother tongue. Consequently, when the speaker’s native language is not the same as the one that was used to train the recognizer, there is a considerable loss in recognition performance. In this thesis, we examine the problem of non-native speech in a speaker-independent and large-vocabulary recognizer in which a small amount of non-native data was used for training. Several experiments were performed using Hidden Markov models, trained with speech corpora containing European Portuguese native speakers, English native speakers and English spoken by European Portuguese native speakers. Initially it was explored the behaviour of an English native model and non-native English speakers’ model. Then using different corpus weights for the English native speakers and English spoken by Portuguese speakers it was trained a model as a pool of accents. Through adaptation techniques it was used the Maximum Likelihood Linear Regression method. It was also explored how European Portuguese speakers pronounce English language studying the correspondences between the phone sets of the foreign and target languages. The result was a new phone set, consequence of the mapping between the English and the Portuguese phone sets. Then a new model was trained with English Spoken by Portuguese speakers’ data and Portuguese native data. Concerning the speech recognition subject this work has other two purposes: collecting Portuguese corpora and supporting the compilation of a Portuguese lexicon, adopting some methods and algorithms to generate automatic phonetic pronunciations. The collected corpora was processed in order to train acoustic models to be used in the Exchange 2007 domain, namely in Outlook Voice Access
    • 

    corecore