Search CORE

1,222 research outputs found

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation

Author: Parihar Naveen
Publication venue: Scholars Junction
Publication date: 10/11/2003
Field of study

Over the past few years, speech recognition technology performance on tasks ranging from isolated digit recognition to conversational speech has dramatically improved. Performance on limited recognition tasks in noiseree environments is comparable to that achieved by human transcribers. This advancement in automatic speech recognition technology along with an increase in the compute power of mobile devices, standardization of communication protocols, and the explosion in the popularity of the mobile devices, has created an interest in flexible voice interfaces for mobile devices. However, speech recognition performance degrades dramatically in mobile environments which are inherently noisy. In the recent past, a great amount of effort has been spent on the development of front ends based on advanced noise robust approaches. The primary objective of this thesis was to analyze the performance of two advanced front ends, referred to as the QIO and MFA front ends, on a speech recognition task based on the Wall Street Journal database. Though the advanced front ends are shown to achieve a significant improvement over an industry-standard baseline front end, this improvement is not operationally significant. Further, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end-specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Finally, we also show that mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends

Mississippi State University Libraries ETD database

Scholars Junction - Mississippi State University Institutional Repository

Recommended from our members

uC: Ubiquitous Collaboration Platform for Multimodal Team Interaction Support

Author: Carstens Deborah
Converse Patrick D
Fiore Stephen M
Gurbuz Sabri
Kepuska Veton Z
Metcalf David
Rodriguez Walter
Publication venue: CSUSB ScholarWorks
Publication date: 01/01/2008
Field of study

A human-centered computing platform that improves teamwork and transforms the “human- computer interaction experience” for distributed teams is presented. This Ubiquitous Collaboration, or uC (“you see”), platform\u27s objective is to transform distributed teamwork (i.e., work occurring when teams of workers and learners are geographically dispersed and often interacting at different times). It achieves this goal through a multimodal team interaction interface realized through a reconfigurable open architecture. The approach taken is to integrate: (1) an intuitive speech- and video-centric multi-modal interface to augment more conventional methods (e.g., mouse, stylus and touch), (2) an open and reconfigurable architecture supporting information gathering, and (3) a machine intelligent approach to analysis and management of heterogeneous live and stored sensor data to support collaboration. The system will transform how teams of people interact with computers by drawing on both the virtual and physical environment

CSUSB ScholarWorks

Recognizing GSM Digital Speech

Author: Díaz de María Fernando
Gallardo Antolín Ascensión
Peláez Moreno Carmen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

The Global System for Mobile (GSM) environment encompasses three main problems for automatic speech recognition (ASR) systems: noisy scenarios, source coding distortion, and transmission errors. The first one has already received much attention; however, source coding distortion and transmission errors must be explicitly addressed. In this paper, we propose an alternative front-end for speech recognition over GSM networks. This front-end is specially conceived to be effective against source coding distortion and transmission errors. Specifically, we suggest extracting the recognition feature vectors directly from the encoded speech (i.e., the bitstream) instead of decoding it and subsequently extracting the feature vectors. This approach offers two significant advantages. First, the recognition system is only affected by the quantization distortion of the spectral envelope. Thus, we are avoiding the influence of other sources of distortion as a result of the encoding-decoding process. Second, when transmission errors occur, our front-end becomes more effective since it is not affected by errors in bits allocated to the excitation signal. We have considered the half and the full-rate standard codecs and compared the proposed front-end with the conventional approach in two ASR tasks, namely, speaker-independent isolated digit recognition and speaker-independent continuous speech recognition. In general, our approach outperforms the conventional procedure, for a variety of simulated channel conditions. Furthermore, the disparity increases as the network conditions worsen

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

A FRAMEWORK FOR INTELLIGENT VOICE-ENABLED E-EDUCATION SYSTEMS

Author: Azeta A. A.
Publication venue
Publication date: 01/03/2012
Field of study

Covenant University Repository

Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages

Author: Leung Cheung Chi
Ma Bin
Ni Chongjia
Sivadas Sunil
Tong Rong
Wang Lei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/10/2022
Field of study

This paper provides an overall introduction of our Automatic Speech Recognition (ASR) systems for Southeast Asian languages. As not much existing work has been carried out on such regional languages, a few difficulties should be addressed before building the systems: limitation on speech and text resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia and Thai as examples to illustrate the strategies of collecting various resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange Technologies (ICOT 2017

arXiv.org e-Print Archive

Analysis of Android Device-Based Solutions for Fall Detection

Author: Casilari Pérez Eduardo
Luque Giráldez José Rafael
Morón Fernández María José
Publication venue: 'MDPI AG'
Publication date: 01/01/2015
Field of study

Falls are a major cause of health and psychological problems as well as hospitalization costs among older adults. Thus, the investigation on automatic Fall Detection Systems (FDSs) has received special attention from the research community during the last decade. In this area, the widespread popularity, decreasing price, computing capabilities, built-in sensors and multiplicity of wireless interfaces of Android-based devices (especially smartphones) have fostered the adoption of this technology to deploy wearable and inexpensive architectures for fall detection. This paper presents a critical and thorough analysis of those existing fall detection systems that are based on Android devices. The review systematically classifies and compares the proposals of the literature taking into account different criteria such as the system architecture, the employed sensors, the detection algorithm or the response in case of a fall alarms. The study emphasizes the analysis of the evaluation methods that are employed to assess the effectiveness of the detection process. The review reveals the complete lack of a reference framework to validate and compare the proposals. In addition, the study also shows that most research works do not evaluate the actual applicability of the Android devices (with limited battery and computing resources) to fall detection solutions.Ministerio de Economía y Competitividad TEC2013-42711-

idUS. Depósito de Investigación Universidad de Sevilla

In Car Audio

Author: A. Lattanzi
A. Primavera
C. Levy
C. Pilato
D. Sciuto
E. Capucci
E. Ciavattini
F. Bettarelli
F. Capman
F. Ferrandi
F. Piazza
J.F. Bonastre
J.G.F. Coutinho
L. Palestrini
M. Lattuada
P. Peretti
R. Toppi
S. Cecchi
S. Thabuteau
W. Luk
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

This chapter presents implementations of advanced in Car Audio Applications. The system is composed by three main different applications regarding the In Car listening and communication experience. Starting from a high level description of the algorithms, several implementations on different levels of hardware abstraction are presented, along with empirical results on both the design process undergone and the performance results achieved

Archivio istituzionale della ricerca - Politecnico di Milano

Modelo acústico de língua inglesa falada por portugueses

Author: Simões Carla Alexandra Coelho
Publication venue
Publication date: 01/01/2007
Field of study

Trabalho de projecto de mestrado em Engenharia Informática, apresentado à Universidade de Lisboa, através da Faculdade de Ciências, 2007No contexto do reconhecimento robusto de fala baseado em modelos de Markov não observáveis (do inglês Hidden Markov Models - HMMs) este trabalho descreve algumas metodologias e experiências tendo em vista o reconhecimento de oradores estrangeiros. Quando falamos em Reconhecimento de Fala falamos obrigatoriamente em Modelos Acústicos também. Os modelos acústicos reflectem a maneira como pronunciamos/articulamos uma língua, modelando a sequência de sons emitidos aquando da fala. Essa modelação assenta em segmentos de fala mínimos, os fones, para os quais existe um conjunto de símbolos/alfabetos que representam a sua pronunciação. É no campo da fonética articulatória e acústica que se estuda a representação desses símbolos, sua articulação e pronunciação. Conseguimos descrever palavras analisando as unidades que as constituem, os fones. Um reconhecedor de fala interpreta o sinal de entrada, a fala, como uma sequência de símbolos codificados. Para isso, o sinal é fragmentado em observações de sensivelmente 10 milissegundos cada, reduzindo assim o factor de análise ao intervalo de tempo onde as características de um segmento de som não variam. Os modelos acústicos dão-nos uma noção sobre a probabilidade de uma determinada observação corresponder a uma determinada entidade. É, portanto, através de modelos sobre as entidades do vocabulário a reconhecer que é possível voltar a juntar esses fragmentos de som. Os modelos desenvolvidos neste trabalho são baseados em HMMs. Chamam-se assim por se fundamentarem nas cadeias de Markov (1856 - 1922): sequências de estados onde cada estado é condicionado pelo seu anterior. Localizando esta abordagem no nosso domínio, há que construir um conjunto de modelos - um para cada classe de sons a reconhecer - que serão treinados por dados de treino. Os dados são ficheiros áudio e respectivas transcrições (ao nível da palavra) de modo a que seja possível decompor essa transcrição em fones e alinhá-la a cada som do ficheiro áudio correspondente. Usando um modelo de estados, onde cada estado representa uma observação ou segmento de fala descrita, os dados vão-se reagrupando de maneira a criar modelos estatísticos, cada vez mais fidedignos, que consistam em representações das entidades da fala de uma determinada língua. O reconhecimento por parte de oradores estrangeiros com pronuncias diferentes da língua para qual o reconhecedor foi concebido, pode ser um grande problema para precisão de um reconhecedor. Esta variação pode ser ainda mais problemática que a variação dialectal de uma determinada língua, isto porque depende do conhecimento que cada orador têm relativamente à língua estrangeira. Usando para uma pequena quantidade áudio de oradores estrangeiros para o treino de novos modelos acústicos, foram efectuadas diversas experiências usando corpora de Portugueses a falar Inglês, de Português Europeu e de Inglês. Inicialmente foi explorado o comportamento, separadamente, dos modelos de Ingleses nativos e Portugueses nativos, quando testados com os corpora de teste (teste com nativos e teste com não nativos). De seguida foi treinado um outro modelo usando em simultâneo como corpus de treino, o áudio de Portugueses a falar Inglês e o de Ingleses nativos. Uma outra experiência levada a cabo teve em conta o uso de técnicas de adaptação, tal como a técnica MLLR, do inglês Maximum Likelihood Linear Regression. Esta última permite a adaptação de uma determinada característica do orador, neste caso o sotaque estrangeiro, a um determinado modelo inicial. Com uma pequena quantidade de dados representando a característica que se quer modelar, esta técnica calcula um conjunto de transformações que serão aplicadas ao modelo que se quer adaptar. Foi também explorado o campo da modelação fonética onde estudou-se como é que o orador estrangeiro pronuncia a língua estrangeira, neste caso um Português a falar Inglês. Este estudo foi feito com a ajuda de um linguista, o qual definiu um conjunto de fones, resultado do mapeamento do inventário de fones do Inglês para o Português, que representam o Inglês falado por Portugueses de um determinado grupo de prestígio. Dada a grande variabilidade de pronúncias teve de se definir este grupo tendo em conta o nível de literacia dos oradores. Este estudo foi posteriormente usado na criação de um novo modelo treinado com os corpora de Portugueses a falar Inglês e de Portugueses nativos. Desta forma representamos um reconhecedor de Português nativo onde o reconhecimento de termos ingleses é possível. Tendo em conta a temática do reconhecimento de fala este projecto focou também a recolha de corpora para português europeu e a compilação de um léxico de Português europeu. Na área de aquisição de corpora o autor esteve envolvido na extracção e preparação dos dados de fala telefónica, para posterior treino de novos modelos acústicos de português europeu. Para compilação do léxico de português europeu usou-se um método incremental semi-automático. Este método consistiu em gerar automaticamente a pronunciação de grupos de 10 mil palavras, sendo cada grupo revisto e corrigido por um linguista. Cada grupo de palavras revistas era posteriormente usado para melhorar as regras de geração automática de pronunciações.The tremendous growth of technology has increased the need of integration of spoken language technologies into our daily applications, providing an easy and natural access to information. These applications are of different nature with different user’s interfaces. Besides voice enabled Internet portals or tourist information systems, automatic speech recognition systems can be used in home user’s experiences where TV and other appliances could be voice controlled, discarding keyboards or mouse interfaces, or in mobile phones and palm-sized computers for a hands-free and eyes-free manipulation. The development of these systems causes several known difficulties. One of them concerns the recognizer accuracy on dealing with non-native speakers with different phonetic pronunciations of a given language. The non-native accent can be more problematic than a dialect variation on the language. This mismatch depends on the individual speaking proficiency and speaker’s mother tongue. Consequently, when the speaker’s native language is not the same as the one that was used to train the recognizer, there is a considerable loss in recognition performance. In this thesis, we examine the problem of non-native speech in a speaker-independent and large-vocabulary recognizer in which a small amount of non-native data was used for training. Several experiments were performed using Hidden Markov models, trained with speech corpora containing European Portuguese native speakers, English native speakers and English spoken by European Portuguese native speakers. Initially it was explored the behaviour of an English native model and non-native English speakers’ model. Then using different corpus weights for the English native speakers and English spoken by Portuguese speakers it was trained a model as a pool of accents. Through adaptation techniques it was used the Maximum Likelihood Linear Regression method. It was also explored how European Portuguese speakers pronounce English language studying the correspondences between the phone sets of the foreign and target languages. The result was a new phone set, consequence of the mapping between the English and the Portuguese phone sets. Then a new model was trained with English Spoken by Portuguese speakers’ data and Portuguese native data. Concerning the speech recognition subject this work has other two purposes: collecting Portuguese corpora and supporting the compilation of a Portuguese lexicon, adopting some methods and algorithms to generate automatic phonetic pronunciations. The collected corpora was processed in order to train acoustic models to be used in the Exchange 2007 domain, namely in Outlook Voice Access

Universidade de Lisboa: Repositório.UL

A Robust Architecture For Human Language Technology Systems

Author: Stanley Theban
Publication venue: Scholars Junction
Publication date: 16/06/2006
Field of study

Early human language technology systems were designed in a monolithic fashion. As these systems became more complex, this design became untenable. In its place, the concept of distributed processing evolved wherein the monolithic structure was decomposed into a number of functional components that could interact through a common protocol. This distributed framework was readily accepted by the research community and has been the cornerstone for the advancement in cutting edge human language technology prototype systems.The Defense Advanced Research Program Agency (DARPA) Communicator program has been highly successful in implementing this approach. The program has fueled the design and development of impressive human language technology applications. Its distributed framework has offered numerous benefits to the research community, including reduced prototype development time, sharing of components across sites, and provision of a standard evaluation platform. It has also enabled development of client-server applications with complex inter-process communication between modules. However, this latter feature, though beneficial, introduces complexities which reduce overall system robustness to failure. In addition, the ability to handle multiple users and multiple applications from a common interface is not innately supported. This thesis describes the enhancements to the original Communicator architecture that address robustness issues and provide a multiple multi-user application environment by enabling automated server startup, error detection and correction. Extensive experimentation and analysis were performed to measure improvements in robustness due to the enhancements to the DARPA architecture. A 7.2% improvement in robustness was achieved on the address querying task, which is the most complex task in the human language technology system

Mississippi State University Libraries ETD database

Scholars Junction - Mississippi State University Institutional Repository

Adaptable dialogue architecture and runtime engine (AdaRTE): A framework for rapid prototyping of health dialog systems

Author: Giorgino Toni
Rojas Barahona Lina Maria
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

International audienceSpoken dialog systems have been increasingly employed to provide ubiquitous access via telephone to information and services for the non-Internet-connected public. They have been successfully applied in the health care context; however, speech technology requires a considerable development investment. The advent of VoiceXML reduced the proliferation of incompatible dialog formalisms, at the expense of adding even more complexity. This paper introduces a novel architecture for dialogue representation and interpretation, AdaRTE, which allows developers to lay out dialog interactions through a high-level formalism, offering both declarative and procedural features. AdaRTE's aim is to provide a ground for deploying complex and adaptable dialogs whilst allowing experimentation and incremental adoption of innovative speech technologies. It enhances augmented transition networks with dynamic behavior, and drives multiple back-end realizers, including VoiceXML. It has been especially targeted to the health care context, because of the great scale and the need for reducing the barrier to a widespread adoption of dialog systems

INRIA a CCSD electronic archive server