76 research outputs found

    Razvoj akustičkog modela hrvatskog jezika pomoću alata HTK

    Get PDF
    Paper presents development of the acoustic model for Croatian language for automatic speech recognition (ASR). Continuous speech recognition is performed by means of the Hidden Markov Models (HMM) implemented in the HMM Toolkit (HTK). In order to adjust the HTK to the native language a novel algorithm for Croatian language transcription (CLT) has been developed. It is based on phonetic assimilation rules that are applied within uttered words. Phonetic questions for state tying of different triphone models have also been developed. The automated system for training and evaluation of acoustic models has been developed and integrated with the new graphical user interface (GUI). Targeted applications of this ASR system are stress inoculation training (SIT) and virtual reality exposure therapy (VRET). Adaptability of the model to a closed set of speakers is important for such applications and this paper investigates the applicability of the HTK tool for typical scenarios. Robustness of the tool to a new language was tested in matched conditions by a parallel training of an English model that was used as a baseline. Ten native Croatian speakers participated in experiments. Encouraging results were achieved and reported with the developed model for Croatian language.Rad opisuje razvoj akustičkog modela hrvatskog jezika za potrebe sustava za automatsko prepoznavanje govora. Prepoznavanje prirodnog spojenog izgovora ostvaruje se korištenjem skrivenih Markovljevih modela (HMM) u okviru alata HTK. U svrhu prilagodbe ovog alata na hrvatski jezik razvijen je novi algoritam za automatsku fonetsku transkripciju hrvatskih riječi. Zasniva se na načelu fonetske asimilacije unutar izgovorenih riječi. Razvijen je i skup fonetskih pitanja koji se koristi za klasifikaciju prilikom udruživanja trifonskih modela sličnih glasova. Razvijena je automatizirana aplikacija za gradnju i evaluaciju akustičkih modela, integrirana s novo razvijenim grafičkim sučeljem. Primjene ovog sustava za prepoznavanje su trening s doziranim izlaganjem stresu (SIT) i terapija izlaganjem primjenom virtualne stvarnosti (VRET). Prilagodljivost akustičkog modela na zatvoren skup govornika vrlo je važna za takve primjene, pa se u radu istražuje primjenjivost alata HTK u tipičnim scenarijima. Robusnost alata na promjenu jezika istražuje se uparenim treniranjem i evaluacijom ekvivalentnog modela engleskog jezika u jednakim uvjetima. U eksperimentima je sudjelovalo deset izvornih hrvatskih govornika. Ostvareni rezultati za hrvatski jezik prikazani u radu pokazuju zadovoljavajuća svojstva razvijenog akustičkog modela hrvatskog jezika

    Towards Natural Human Control and Navigation of Autonomous Wheelchairs

    Get PDF
    Approximately 2.2 million people in the United States depend on a wheelchair to assist with their mobility. Often times, the wheelchair user can maneuver around using a conventional joystick. Visually impaired or wheelchair patients with restricted hand mobility, such as stroke, arthritis, limb injury, Parkinson’s, cerebral palsy or multiple sclerosis, prevent them from using traditional joystick controls. The resulting mobility limitations force these patients to rely on caretakers to perform everyday tasks. This minimizes the independence of the wheelchair user. Modern day speech recognition systems can be used to enhance user experiences when using electronic devices. By expanding the motorized wheelchair control interface to include the detection of user speech commands, the independence is given back to the mobility impaired. A speech recognition interface was developed for a smart wheelchair. By integrating navigation commands with a map of the wheelchair’s surroundings, the wheelchair interface is more natural and intuitive to use. Complex speech patterns are interpreted for users to command the smart wheelchair to navigate to specified locations within the map. Pocketsphinx, a speech toolkit, is used to interpret the vocal commands. A language model and dictionary were generated based on a set of possible commands and locations supplied to the speech recognition interface. The commands fall under the categories of speed, directional, or destination commands. Speed commands modify the relative speed of the wheelchair. Directional commands modify the relative direction of the wheelchair. Destination commands require a known location on a map to navigate to. The completion of the speech input processer and the connection between wheelchair components via the Robot Operating System make map navigation possible

    Automatic Speech recognition, with large vocabulary, robustness, independence of speaker and multilingual processing

    Get PDF
    Durante todo o trabalho, o sistema de reconhecimento de fala contínua de grande vocabulário Julius é utilizado em conjunto com o Hidden Markov Model Toolkit(HTK). O sistema Julius tem suas principais características descritas, tendo inclusive sido modificado. Inicialmente, a teoria de reconhecimento de sinais de fala é demonstrada. Experimentos são feitos com adaptação de modelos ocultos de Marvov e com a técnica de validação cruzada K-Fold. Resultados de reconhecimento de fala após adaptação acústica à um locutor específico (e da criação de modelos de linguagem específicos para um cenário de demonstração do sistema) demonstraram 86.39% de taxa de acerto de sentença para os modelos acústicos holandeses. Os mesmos dados demonstram 94.44% de taxa de acerto semântico de sentença

    Error handling in multimodal voice-enabled interfaces of tour-guide robots using graphical models

    Get PDF
    Mobile service robots are going to play an increasing role in the society of humans. Voice-enabled interaction with service robots becomes very important, if such robots are to be deployed in real-world environments and accepted by the vast majority of potential human users. The research presented in this thesis addresses the problem of speech recognition integration in an interactive voice-enabled interface of a service robot, in particular a tour-guide robot. The task of a tour-guide robot is to engage visitors to mass exhibitions (users) in dialogue providing the services it is designed for (e.g. exhibit presentations) within a limited time. In managing tour-guide dialogues, extracting the user goal (intention) for requesting a particular service at each dialogue state is the key issue. In mass exhibition conditions speech recognition errors are inevitable because of noisy speech and uncooperative users of robots with no prior experience in robotics. They can jeopardize the user goal identification. Wrongly identified user goals can lead to communication failures. Therefore, to reduce the risk of such failures, methods for detecting and compensating for communication failures in human-robot dialogue are needed. During the short-term interaction with visitors, the interpretation of the user goal at each dialogue state can be improved by combining speech recognition in the speech modality with information from other available robot modalities. The methods presented in this thesis exploit probabilistic models for fusing information from speech and auxiliary modalities of the robot for user goal identification and communication failure detection. To compensate for the detected communication failures we investigate multimodal methods for recovery from communication failures. To model the process of modality fusion, taking into account the uncertainties in the information extracted from each input modality during human-robot interaction, we use the probabilistic framework of Bayesian networks. Bayesian networks are graphical models that represent a joint probability function over a set of random variables. They are used to model the dependencies among variables associated with the user goals, modality related events (e.g. the event of user presence that is inferred from the laser scanner modality of the robot), and observed modality features providing evidence in favor of these modality events. Bayesian networks are used to calculate posterior probabilities over the possible user goals at each dialogue state. These probabilities serve as a base in deciding if the user goal is valid, i.e. if it can be mapped into a tour-guide service (e.g. exhibit presentation) or is undefined – signaling a possible communication failure. The Bayesian network can be also used to elicit probabilities over the modality events revealing information about the possible cause for a communication failure. Introducing new user goal aspects (e.g. new modality events and related features) that provide auxiliary information for detecting communication failures makes the design process cumbersome, calling for a systematic approach in the Bayesian network modelling. Generally, introducing new variables for user goal identification in the Bayesian networks can lead to complex and computationally expensive models. In order to make the design process more systematic and modular, we adapt principles from the theory of grounding in human communication. When people communicate, they resolve understanding problems in a collaborative joint effort of providing evidence of common shared knowledge (grounding). We use Bayesian network topologies, tailored to limited computational resources, to model a state-based grounding model fusing information from three different input modalities (laser, video and speech) to infer possible grounding states. These grounding states are associated with modality events showing if the user is present in range for communication, if the user is attending to the interaction, whether the speech modality is reliable, and if the user goal is valid. The state-based grounding model is used to compute probabilities that intermediary grounding states have been reached. This serves as a base for detecting if the the user has reached the final grounding state, or wether a repair dialogue sequence is needed. In the case of a repair dialogue sequence, the tour-guide robot can exploit the multiple available modalities along with speech. For example, if the user has failed to reach the grounding state related to her/his presence in range for communication, the robot can use its move modality to search and attract the attention of the visitors. In the case when speech recognition is detected to be unreliable, the robot can offer the alternative use of the buttons modality in the repair sequence. Given the probability of each grounding state, and the dialogue sequence that can be executed in the next dialogue state, a tour-guide robot has different preferences on the possible dialogue continuation. If the possible dialogue sequences at each dialogue state are defined as actions, the introduced principle of maximum expected utility (MEU) provides an explicit way of action selection, based on the action utility, given the evidence about the user goal at each dialogue state. Decision networks, constructed as graphical models based on Bayesian networks are proposed to perform MEU-based decisions, incorporating the utility of the actions to be chosen at each dialogue state by the tour-guide robot. These action utilities are defined taking into account the tour-guide task requirements. The proposed graphical models for user goal identification and dialogue error handling in human-robot dialogue are evaluated in experiments with multimodal data. These data were collected during the operation of the tour-guide robot RoboX at the Autonomous System Lab of EPFL and at the Swiss National Exhibition in 2002 (Expo.02). The evaluation experiments use component and system level metrics for technical (objective) and user-based (subjective) evaluation. On the component level, the technical evaluation is done by calculating accuracies, as objective measures of the performance of the grounding model, and the resulting performance of the user goal identification in dialogue. The benefit of the proposed error handling framework is demonstrated comparing the accuracy of a baseline interactive system, employing only speech recognition for user goal identification, and a system equipped with multimodal grounding models for error handling

    Design of hardware architectures for HMM–based signal processing systems with applications to advanced human-machine interfaces

    Get PDF
    In questa tesi viene proposto un nuovo approccio per lo sviluppo di interfacce uomo–macchina. In particolare si tratta il caso di sistemi di pattern recognition che fanno uso di Hidden Markov Models per la classificazione. Il progetto di ricerca è partito dall’ideazione di nuove tecniche per la realizzazione di sistemi di riconoscimento vocale per parlato spontaneo. Gli HMM sono stati scelti come lo strumento algoritmico di base per la realizzazione del sistema. Dopo una fase di studio preliminare gli obiettivi sono stati estesi alla realizzazione di una architettura hardware in grado di fornire uno strumento riconfigurabile che possa essere utilizzato non solo per il riconoscimento vocale, ma in qualsiasi tipo di classificatore basato su HMM. Il lavoro si concentra quindi sullo sviluppo di architetture hardware dedicate, ma nuovi risultati sono stati ottenuti anche a livello di applicazione per quanto riguarda la classificazione di segnali elettroencefalografici attraverso gli HMM. Innanzitutto state sviluppata una architettura a livello di sistema applicabile a qualsiasi sistema di pattern recognition che faccia usi di HMM. L’architettura stata concepita in modo tale da essere utilizzabile come un sistema stand–alone. Definita l’architettura, un processore hardware per HMM, completamente riconfigurabile, stato decritto in linguaggio VHDL e simulato con successo. Un array parallelo di questi processori costituisce di fatto il nucleo di processamento dell’architettura sviluppata. Sulla base del progetto in VHDL, due piattaforme di prototipaggio rapido basate su FPGA sono state selezionate per dei test di implementazione. Diverse configurazioni costituite da array paralleli di processori HMM sono state implementate su FPGA. Le soluzioni che offrivano un miglior compromesso tra prestazioni e quantità di risorse hardware utilizzate sono state selezionate per ulteriori analisi. Un sistema software per il pattern recognition basato su HMM stato scelto come sistema di riferimento per verificare la corretta funzionalità delle architetture implementate. Diversi test sono stati progettati per validare che il funzionamento del sistema corrispondesse alle specifiche iniziali. Le versioni implementate del sistema sono state confrontate con il software di riferimento sulla base dei risultati forniti dai test. Dal confronto è stato possibile appurare che le architetture sviluppate hanno un comportamento corrispondente a quello richiesto. Infine le implementazioni dell’array parallelo di processori HMM `e sono state applicate a due applicazioni reali: un riconoscitore vocale, ed un classificatore per interfacce basate su segnali elettroencefalografici. In entrambi i casi l’architettura si è dimostrata in grado di gestire l’applicazione senza alcun problema. L’uso del processamento hardware per il riconoscimento vocale apre di fatto la strada a nuovi sviluppi nel campo grazie al notevole incremento di prestazioni ottenibili in termini di tempo di esecuzione. L’applicazione al processamento dell’EEG, invece, introduce di fatto un approccio completamente nuovo alla classificazione di questo tipo di segnali, e mostra come in futuro potrebbe essere possibile lo sviluppo di interfacce basate sulla classificazione dei segnali generati dal pensiero spontaneo. I possibili sviluppi del lavoro iniziato con questa tesi sono molteplici. Una direzione possibile è quella dell’implementazione completa dell’architettura proposta come un sistema stand–alone riconfigurabile per l’accelerazione di sistemi per pattern recognition di qualsiasi natura purchè basati su HMM. Le potenzialità di tale sistema renderebbero possibile la realizzazione di classificatiori in tempo reale con un alto grado di complessità, e quindi allo sviluppo di interfacce realmente multimodali, con una vasta gamma di applicazioni, dai sistemi di per lo spazio a quelli di supporto per persone disabili.In this thesis a new approach is described for the development of human–computer interfaces. In particular the case of pattern recognition systems based on Hidden Markov Models have been taken into account. The research started from he development of techniques for the realization of natural language speech recognition systems. The Hidden Markov Model (HMM) was chosen as the main algorithmic tool to be used to build the system. After the early work the goal was extended to the development of an hardware architecture that provided a reconfigurable tool to be used in any pattern recognition task, and not only in speech recognition. The whole work is thus focused on the development of dedicated hardware architectures, but also some new results have been obtained on the classification of electroencephalographic signals through the use of HMMs. Firstly a system–level architecture has been developed to be used in HMM based pattern recognition systems. The architecture has been conceived in order to be able to work as a stand–alone system. Then a VHDL description has been made of a flexible and completely reconfigurable hardware HMM processor and the design was successfully simulated. A parallel array of these processors is actually the core processing block of the developed architecture. Then two suitable FPGA based, fast prototyping platforms have been identified to be the targets for the implementation tests. Different configurations of parallel HMM processor arrays have been set up and mapped on the target FPGAs. Some solutions have been selected to be the best in terms of balance between performance and resources utilization. Furthermore a software HMM based pattern recognition system has been chosen to be the reference system for the functionality of the implemented subsystems. A set of tests have been developed with the aim to test the correct functionality of the hardware. The implemented system was compared to the reference system on the basis of the tests’ results, and it was found that the behavior was the one expected and the required functionality was correctly achieved. Finally the implementation of the parallel HMM array was tested through its application to two real–world applications: a speech recognition task and a brain–computer interface task. In both cases the architecture showed to be functionally suitable and powerful enough to handle the task without problems. The application of the hardware processing to speech recognition opens new perspectives in the design of this kind of systems because of the dramatic increment in performance. The application to brain–computer interface is really interesting because of a new approach in the classification of EEG that shows how could be possible a future development of interfaces based on the classification of spontaneous thought. The possible evolution directions of the work started with this thesis are many. Effort could be spent of the implementation of the developed architecture as a stand–alone reconfigurable system suitable for any kind of HMM–based pattern recognition task. The potential performance of such a system could open the way to extremely complex real–time pattern recognition systems, and thus to the realization of truly multimodal interfaces, with a variety of applications, from space to aid systems for the impaired

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Ses veya Arayüz Yardımı ile Kontrol Edilebilen Mobil Robot Kol Tasarımı

    Get PDF
    Robot teknolojisinin hızlı gelişimine paralel olarak mobil araç, robot kol ve ses işleme teknolojisi de hızlı bir gelişim göstermiştir. Bu teknolojik gelişmede robotlardan beklenen en önemli parametrelerin başında güvenlik, çözüm üretme ve hız gelmektedir. Bu çalışmada, mobil araç üzerine bir robot kol yerleştirilmesi ve bu sitemlerin tasarlanan arayüz ve ses sistemi ile kontrol edilerek daha verimli ve hızlı çalışması amaçlanmıştır. Amaçlar doğrultusunda önce üzerinde bir robot kol bulunan bir mobil araç tasarımı gerçekleştirilmiş, sonrada bu mobil aracın ve robot kolun hem arayüz hem de ses komutları ile kontrol edilmesi sağlanmıştır. Yapılan test sonuçları incelendiğinde ses komutları ile kontrol sisteminin, arayüz ile kontrol sistemine göre daha verimli olduğu gözlenmiştir

    A novel lip geometry approach for audio-visual speech recognition

    Get PDF
    By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset

    Tools for expressive gesture recognition and mapping in rehearsal and performance

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 97-101).As human movement is an incredibly rich mode of communication and expression, performance artists working with digital media often use performers' movement and gestures to control and shape that digital media as part of a theatrical, choreographic, or musical performance. In my own work, I have found that strong, semantically-meaningful mappings between gesture and sound or visuals are necessary to create compelling performance interactions. However, the existing systems for developing mappings between incoming data streams and output media have extremely low-level concepts of "gesture." The actual programming process focuses on low-level sensor data, such as the voltage values of a particular sensor, which limits the user in his or her thinking process, requires users to have significant programming experience, and loses the expressive, meaningful, and metaphor-rich content of the movement. To remedy these difficulties, I have created a new framework and development environment for gestural control of media in rehearsal and performance, allowing users to create clear and intuitive mappings in a simple and flexible manner by using high-level descriptions of gestures and of gestural qualities. This approach, the Gestural Media Framework, recognizes continuous gesture and translates Laban Effort Notation into the realm of technological gesture analysis, allowing for the abstraction and encapsulation of sensor data into movement descriptions. As part of the evaluation of this system, I choreographed four performance pieces that use this system throughout the performance and rehearsal process to map dancers' movements to manipulation of sound and visual elements. This work has been supported by the MIT Media Laboratory.by Elena Naomi Jessop.S.M
    corecore