1,514 research outputs found
Towards Natural Human Control and Navigation of Autonomous Wheelchairs
Approximately 2.2 million people in the United States depend on a wheelchair to assist with their mobility. Often times, the wheelchair user can maneuver around using a conventional joystick. Visually impaired or wheelchair patients with restricted hand mobility, such as stroke, arthritis, limb injury, Parkinson’s, cerebral palsy or multiple sclerosis, prevent them from using traditional joystick controls. The resulting mobility limitations force these patients to rely on caretakers to perform everyday tasks. This minimizes the independence of the wheelchair user. Modern day speech recognition systems can be used to enhance user experiences when using electronic devices. By expanding the motorized wheelchair control interface to include the detection of user speech commands, the independence is given back to the mobility impaired. A speech recognition interface was developed for a smart wheelchair. By integrating navigation commands with a map of the wheelchair’s surroundings, the wheelchair interface is more natural and intuitive to use. Complex speech patterns are interpreted for users to command the smart wheelchair to navigate to specified locations within the map. Pocketsphinx, a speech toolkit, is used to interpret the vocal commands. A language model and dictionary were generated based on a set of possible commands and locations supplied to the speech recognition interface. The commands fall under the categories of speed, directional, or destination commands. Speed commands modify the relative speed of the wheelchair. Directional commands modify the relative direction of the wheelchair. Destination commands require a known location on a map to navigate to. The completion of the speech input processer and the connection between wheelchair components via the Robot Operating System make map navigation possible
Robotic user interface enabled interactive dialogue with intelligent spaces
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (leaves 81-86).Users can communicate with ubiquitous computing environments by natural means such as voice communication. However, users of the Intelligent Room at MIT CSAIL, a ubiquitous environment, have reported dissatisfaction communicating with the room due to the absence of a focal point and the room's inability to hold a dialogue. To enrich the user's interactive experience, we integrated a Robotic User Interface to the room, and augmented the room's natural language system to enable it to hold dialogues with users. The robotic teddy bear serves two purposes. First, it acts as the focal point of the room which users can address. Second, it enables the room to physically communicate with users by robotic gestures. We also incorporated a book recommendation system to illustrate the room's new ability to converse with users. These enhancements have heightened user experience in communicating with the Intelligent Room, as indicated by our user study.by Rubaiyat Khan.M.Eng
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Contextual awareness, messaging and communication in nomadic audio environments
Thesis (M.S.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1998.Includes bibliographical references (p. 119-122).Nitin Sawhney.M.S
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Narrated guided tour following and interpretation by an autonomous wheelchair
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student submitted PDF version of thesis.Includes bibliographical references (p. 79-81).This work addresses the fundamental problem of how a robot acquires local knowledge about its environment. The domain that we are concerned with is a speech-commandable robotic wheelchair operating in a home/special care environment, capable of navigating autonomously to a verbally-specified location in the environment. We address this problem by incorporating a narrated guided tour following capability into the autonomous wheelchair. In our method, a human gives a narrated guided tour through the environment, while the wheelchair follows. The guide carries out a continuous dialogue with the wheelchair, describing the names of the salient locations in and around his/her immediate vicinity. The wheelchair constructs a metrical map of the environment, and based on the spatial structure and the locations of the described places, segments the map into a topological representation with corresponding tagged locations. This representation of the environment allows the wheelchair to interpret and implement high-level navigation commands issued by the user. To achieve this capability, our system consists of an autonomous wheelchair, a person- following module allowing the wheelchair to track and follow the tour guide as s/he conducts the tour, a simultaneous localization and mapping module to construct the metric gridmap, a spoken dialogue manager to acquire semantic information about the environment, a map segmentation module to bind the metrical and topological representations and to relate tagged locations to relevant nodes, and a navigation module to utilize these representations to provide speech-commandable autonomous navigation.by Sachithra Madhawa Hemachandra.S.M
Bio-motivated features and deep learning for robust speech recognition
Mención Internacional en el título de doctorIn spite of the enormous leap forward that the Automatic Speech
Recognition (ASR) technologies has experienced over the last five years
their performance under hard environmental condition is still far from
that of humans preventing their adoption in several real applications.
In this thesis the challenge of robustness of modern automatic speech
recognition systems is addressed following two main research lines.
The first one focuses on modeling the human auditory system to
improve the robustness of the feature extraction stage yielding to novel
auditory motivated features. Two main contributions are produced.
On the one hand, a model of the masking behaviour of the Human
Auditory System (HAS) is introduced, based on the non-linear filtering
of a speech spectro-temporal representation applied simultaneously
to both frequency and time domains. This filtering is accomplished
by using image processing techniques, in particular mathematical
morphology operations with an specifically designed Structuring Element
(SE) that closely resembles the masking phenomena that take
place in the cochlea. On the other hand, the temporal patterns of
auditory-nerve firings are modeled. Most conventional acoustic features
are based on short-time energy per frequency band discarding
the information contained in the temporal patterns. Our contribution
is the design of several types of feature extraction schemes based on
the synchrony effect of auditory-nerve activity, showing that the modeling
of this effect can indeed improve speech recognition accuracy in
the presence of additive noise. Both models are further integrated into
the well known Power Normalized Cepstral Coefficients (PNCC).
The second research line addresses the problem of robustness in
noisy environments by means of the use of Deep Neural Networks
(DNNs)-based acoustic modeling and, in particular, of Convolutional
Neural Networks (CNNs) architectures. A deep residual network
scheme is proposed and adapted for our purposes, allowing Residual
Networks (ResNets), originally intended for image processing tasks,
to be used in speech recognition where the network input is small
in comparison with usual image dimensions. We have observed that
ResNets on their own already enhance the robustness of the whole system
against noisy conditions. Moreover, our experiments demonstrate
that their combination with the auditory motivated features devised
in this thesis provide significant improvements in recognition accuracy
in comparison to other state-of-the-art CNN-based ASR systems
under mismatched conditions, while maintaining the performance in
matched scenarios.
The proposed methods have been thoroughly tested and compared
with other state-of-the-art proposals for a variety of datasets and
conditions. The obtained results prove that our methods outperform
other state-of-the-art approaches and reveal that they are suitable for
practical applications, specially where the operating conditions are
unknown.El objetivo de esta tesis se centra en proponer soluciones al problema
del reconocimiento de habla robusto; por ello, se han llevado a cabo
dos líneas de investigación.
En la primera líınea se han propuesto esquemas de extracción de características novedosos, basados en el modelado del comportamiento
del sistema auditivo humano, modelando especialmente los fenómenos
de enmascaramiento y sincronía. En la segunda, se propone mejorar
las tasas de reconocimiento mediante el uso de técnicas de
aprendizaje profundo, en conjunto con las características propuestas.
Los métodos propuestos tienen como principal objetivo, mejorar la
precisión del sistema de reconocimiento cuando las condiciones de
operación no son conocidas, aunque el caso contrario también ha sido
abordado.
En concreto, nuestras principales propuestas son los siguientes:
Simular el sistema auditivo humano con el objetivo de mejorar
la tasa de reconocimiento en condiciones difíciles, principalmente
en situaciones de alto ruido, proponiendo esquemas de
extracción de características novedosos.
Siguiendo esta dirección, nuestras principales propuestas se detallan a continuación:
• Modelar el comportamiento de enmascaramiento del sistema
auditivo humano, usando técnicas del procesado de
imagen sobre el espectro, en concreto, llevando a cabo el
diseño de un filtro morfológico que captura este efecto.
• Modelar el efecto de la sincroní que tiene lugar en el nervio
auditivo.
• La integración de ambos modelos en los conocidos Power
Normalized Cepstral Coefficients (PNCC).
La aplicación de técnicas de aprendizaje profundo con el objetivo
de hacer el sistema más robusto frente al ruido, en particular
con el uso de redes neuronales convolucionales profundas, como
pueden ser las redes residuales.
Por último, la aplicación de las características propuestas en
combinación con las redes neuronales profundas, con el objetivo
principal de obtener mejoras significativas, cuando las condiciones
de entrenamiento y test no coinciden.Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Javier Ferreiros López.- Secretario: Fernando Díaz de María.- Vocal: Rubén Solera Ureñ
- …