133 research outputs found

    Semi-hidden markov models for visible light communication channels

    Get PDF
    A dissertation submitted to the Faculty of Engineering and the Built Environment, University of the Witwatersrand, Johannesburg, in fulfillment of the requirements for the degree of Master of Science in Engineering, Johannesburg 2018Visible Light Communication (VLC) is an emerging ïŹeld in optical wireless communication that uses light emitting diodes (LEDs) for data transmission. LEDs are being widely adopted both indoors and outdoors due to their low cost, long lifespan and high eïŹƒciency. Furthermore, LEDs can be modulated to provide both illumination and wireless communication. There is also potential for VLC to be incorporated into future smart lighting systems. One of the current challenges in VLC is being able to deal with noise and interference; including interference from other dimmed, Pulse-Width Modulated (PWM) LEDs. Other noise includes natural light from the sun and artiïŹcial light from other non-modulating light sources. Modelling these types of channels is one of the ïŹrst steps in understanding the channel and eventually designing techniques for mitigating the eïŹ€ects of noise and interference. This dissertation presents a semi-hidden Markov model, known as the Fritchman model, that discretely models the eïŹ€ects of as well as errors introduced from noise and interference in on-oïŹ€ keying modulated VLC channels. Models have been developed for both the indoor and outdoor environments and can be used for VLC simulations and designing error mitigation techniques. Results show that certain channels are able to be better modelled than others. Experimental error distributions shows insights into the impact that PWM interference has on VLC channels. This can be used for assisting in the development of error control codes and interference avoidance techniques in standalone VLC systems, as well as systems where VLC and smart lighting coexist. The models developed can also be used for simulations of VLC channels under diïŹ€erent channel conditionsXL201

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Hidden Markov Models

    Get PDF
    Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research

    Spoken command recognition for robotics

    Get PDF
    In this thesis, I investigate spoken command recognition technology for robotics. While high robustness is expected, the distant and noisy conditions in which the system has to operate make the task very challenging. Unlike commercial systems which all rely on a "wake-up" word to initiate the interaction, the pipeline proposed here directly detect and recognizes commands from the continuous audio stream. In order to keep the task manageable despite low-resource conditions, I propose to focus on a limited set of commands, thus trading off flexibility of the system against robustness. Domain and speaker adaptation strategies based on a multi-task regularization paradigm are first explored. More precisely, two different methods are proposed which rely on a tied loss function which penalizes the distance between the output of several networks. The first method considers each speaker or domain as a task. A canonical task-independent network is jointly trained with task-dependent models, allowing both types of networks to improve by learning from one another. While an improvement of 3.2% on the frame error rate (FER) of the task-independent network is obtained, this only partially carried over to the phone error rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel training of the canonical network with a privileged model having access to i-vectors. This method proved less effective with only 1.2% of improvement on the FER. In order to make the developed technology more accessible, I also investigated the use of a sequence-to-sequence (S2S) architecture for command classification. The use of an attention-based encoder-decoder model reduced the classification error by 40% relative to a strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing the relevance of S2S architectures in such context. In order to improve the flexibility of the trained system, I also explored strategies for few-shot learning, which allow to extend the set of commands with minimum requirements in terms of data. Retraining a model on the combination of original and new commands, I managed to achieve 40.5% of accuracy on the new commands with only 10 examples for each of them. This scores goes up to 81.5% of accuracy with a larger set of 100 examples per new command. An alternative strategy, based on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10 and 100 examples respectively, while being faster to train. This high performance is obtained at the expense of the original categories though, on which the accuracy deteriorated. Those results are very promising as the methods allow to easily extend an existing S2S model with minimal resources. Finally, a full spoken command recognition system (named iCubrec) has been developed for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to propose a fully hand-free experience. By segmenting only regions that are likely to contain commands, the VAD module also allows to reduce greatly the computational cost of the pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM command recognition system for transcription. The VoCub dataset has been specifically gathered to train a DNN-based acoustic model for our task. Through multi-condition training with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model, complemented by a rejection mechanism based on a confidence score, is finally added to the system to reject non-command speech in a live demonstration of the system

    Speech-driven animation using multi-modal hidden Markov models

    Get PDF
    The main objective of this thesis was the synthesis of speech synchronised motion, in particular head motion. The hypothesis that head motion can be estimated from the speech signal was confirmed. In order to achieve satisfactory results, a motion capture data base was recorded, a definition of head motion in terms of articulation was discovered, a continuous stream mapping procedure was developed, and finally the synthesis was evaluated. Based on previous research into non-verbal behaviour basic types of head motion were invented that could function as modelling units. The stream mapping method investigated in this thesis is based on Hidden Markov Models (HMMs), which employ modelling units to map between continuous signals. The objective evaluation of the modelling parameters confirmed that head motion types could be predicted from the speech signal with an accuracy above chance, close to 70%. Furthermore, a special type ofHMMcalled trajectoryHMMwas used because it enables synthesis of continuous output. However head motion is a stochastic process therefore the trajectory HMM was further extended to allow for non-deterministic output. Finally the resulting head motion synthesis was perceptually evaluated. The effects of the “uncanny valley” were also considered in the evaluation, confirming that rendering quality has an influence on our judgement of movement of virtual characters. In conclusion a general method for synthesising speech-synchronised behaviour was invented that can applied to a whole range of behaviours

    Acoustic event detection and localization using distributed microphone arrays

    Get PDF
    Automatic acoustic scene analysis is a complex task that involves several functionalities: detection (time), localization (space), separation, recognition, etc. This thesis focuses on both acoustic event detection (AED) and acoustic source localization (ASL), when several sources may be simultaneously present in a room. In particular, the experimentation work is carried out with a meeting-room scenario. Unlike previous works that either employed models of all possible sound combinations or additionally used video signals, in this thesis, the time overlapping sound problem is tackled by exploiting the signal diversity that results from the usage of multiple microphone array beamformers. The core of this thesis work is a rather computationally efficient approach that consists of three processing stages. In the first, a set of (null) steering beamformers is used to carry out diverse partial signal separations, by using multiple arbitrarily located linear microphone arrays, each of them composed of a small number of microphones. In the second stage, each of the beamformer output goes through a classification step, which uses models for all the targeted sound classes (HMM-GMM, in the experiments). Then, in a third stage, the classifier scores, either being intra- or inter-array, are combined using a probabilistic criterion (like MAP) or a machine learning fusion technique (fuzzy integral (FI), in the experiments). The above-mentioned processing scheme is applied in this thesis to a set of complexity-increasing problems, which are defined by the assumptions made regarding identities (plus time endpoints) and/or positions of sounds. In fact, the thesis report starts with the problem of unambiguously mapping the identities to the positions, continues with AED (positions assumed) and ASL (identities assumed), and ends with the integration of AED and ASL in a single system, which does not need any assumption about identities or positions. The evaluation experiments are carried out in a meeting-room scenario, where two sources are temporally overlapped; one of them is always speech and the other is an acoustic event from a pre-defined set. Two different databases are used, one that is produced by merging signals actually recorded in the UPCÂżs department smart-room, and the other consists of overlapping sound signals directly recorded in the same room and in a rather spontaneous way. From the experimental results with a single array, it can be observed that the proposed detection system performs better than either the model based system or a blind source separation based system. Moreover, the product rule based combination and the FI based fusion of the scores resulting from the multiple arrays improve the accuracies further. On the other hand, the posterior position assignment is performed with a very small error rate. Regarding ASL and assuming an accurate AED system output, the 1-source localization performance of the proposed system is slightly better than that of the widely-used SRP-PHAT system, working in an event-based mode, and it even performs significantly better than the latter one in the more complex 2-source scenario. Finally, though the joint system suffers from a slight degradation in terms of classification accuracy with respect to the case where the source positions are known, it shows the advantage of carrying out the two tasks, recognition and localization, with a single system, and it allows the inclusion of information about the prior probabilities of the source positions. It is worth noticing also that, although the acoustic scenario used for experimentation is rather limited, the approach and its formalism were developed for a general case, where the number and identities of sources are not constrained

    Creative Support Musical Composition System: a study on Multiple Viewpoints Representations in Variable Markov Oracle

    Get PDF
    Em meados do sĂ©culo XX, assistiu-se ao surgimento de uma ĂĄrea de estudo focada na geração au-tomĂĄtica de conteĂșdo musical por meios computacionais. Os primeiros exemplos concentram-se no processamento offline de dados musicais mas, recentemente, a comunidade tem vindo a explorar maioritariamente sistemas musicais interativos e em tempo-real. AlĂ©m disso, uma tendĂȘncia recente enfatiza a importĂąncia da tecnologia assistiva, que promove uma abordagem centrada em escolhas do utilizador, oferecendo vĂĄrias sugestĂ”es para um determinado problema criativo. Nesse contexto, a minha investigação tem como objetivo promover novas ferramentas de software para sistemas de suporte criativo, onde algoritmos podem participar colaborativamente no fluxo de composição. Em maior detalhe, procuro uma ferramenta que aprenda com dados musicais de tamanho variĂĄvel para fornecer feedback em tempo real durante o processo de composição. À luz das caracterĂ­sticas de multi-dimensionalidade e hierarquia presentes nas estruturas musicais, pretendo estudar as representaçÔes que abstraem os seus padrĂ”es temporais, para promover a geração de mĂșltiplas soluçÔes ordenadas por grau de optimização para um determinado contexto musical. Por fim, a natureza subjetiva da escolha Ă© dada ao utilizador, ao qual Ă© fornecido um nĂșmero limitado de soluçÔes 'ideais'. Uma representação simbĂłlica da mĂșsica manifestada como Modelos sob mĂșltiplos pontos de vista, combinada com o autĂłmato Variable Markov Oracle (VMO), Ă© usada para testar a interação ideal entre a multi-dimensionalidade da representação e a idealidade do modelo VMO, fornecendo soluçÔes coerentes, inovadoras e estilisticamente diversas. Para avaliar o sistema, foram realizados testes para validar a ferramenta num cenĂĄrio especializado com alunos de composição, usando o modelo de testes do Ă­ndice de suporte Ă  criatividade.The mid-20th century witnessed the emergence of an area of study that focused on the automatic generation of musical content by computational means. Early examples focus on offline processing of musical data and recently, the community has moved towards interactive online musical systems. Furthermore, a recent trend stresses the importance of assistive technology, which pro-motes a user-in-loop approach by offering multiple suggestions to a given creative problem. In this context, my research aims to foster new software tools for creative support systems, where algorithms can collaboratively participate in the composition flow. In greater detail, I seek a tool that learns from variable-length musical data to provide real-time feedback during the composition process. In light of the multidimensional and hierarchical structure of music, I aim to study the representations which abstract its temporal patterns, to foster the generation of multiple ranked solutions to a given musical context. Ultimately, the subjective nature of the choice is given to the user to which a limited number of 'optimal' solutions are provided. A symbolic music representation manifested as Multiple Viewpoint Models combined with the Variable Markov Oracle (VMO) automaton, are used to test optimal interaction between the multi-dimensionality of the representation with the optimality of the VMO model in providing both style-coherent, novel, and diverse solutions. To evaluate the system, an experiment was conducted to validate the tool in an expert-based scenario with composition students, using the creativity support index test
    • 

    corecore