979 research outputs found

    Aspects of Application of Neural Recognition to Digital Editions

    Get PDF
    Artificial neuronal networks (ANN) are widely used in software systems which require solutions to problems without a traditional algorithmic approach, like in character recognition: ANN learn by example, so that they require a consistent and well-chosen set of samples to be trained to recognize their patterns. The network is taught to react with high activity in some of its output neurons whenever an input sample belonging to a specified class (e.g. a letter shape) is presented, and has the ability to assess the similarity of samples never encountered before by any of these models. Typical OCR applications thus require a significant amount of preprocessing for such samples, like resizing images and removing all the "noise" data, letting the letter contours emerge clearly from the background. Furthermore, usually a huge number of samples is required to effectively train a network to recognize a character against all the others. This may represent an issue for palaeographical applications because of the relatively low quantity and high complexity of digital samples available, and poses even more problems when our aim is detecting subtle differences (e.g. the special shape of a specific letter from a well-defined period and scriptorium). It would be probably wiser for scholars to define some guidelines for extracting from samples the features defined as most relevant according to their purposes, and let the network deal with just a subset of the overwhelming amount of detailed nuances available. ANN are no magic, and it is always the careful judgement of scholars to provide a theoretical foundation for any computer-based tool they might want to use to help them solve their problems: we can easily illustrate this point with samples drawn from any other application of IT to humanities. Just as we can expect no magic in detecting alliterations in a text if we simply feed a system with a collection of letters, we can no more claim that a neural recognition system might be able to perform well with a relatively small sample where each shape is fed as it is, without instructing the system about the features scholars define as relevant. Even before ANN implementations, it is exactly this theoretical background which must be put to the test when planning such systems

    Classification of Arabic fricative consonants according to their places of articulation

    Get PDF
    Many technology systems have used voice recognition applications to transcribe a speaker’s speech into text that can be used by these systems. One of the most complex tasks in speech identification is to know, which acoustic cues will be used to classify sounds. This study presents an approach for characterizing Arabic fricative consonants in two groups (sibilant and non-sibilant). From an acoustic point of view, our approach is based on the analysis of the energy distribution, in frequency bands, in a syllable of the consonant-vowel type. From a practical point of view, our technique has been implemented, in the MATLAB software, and tested on a corpus built in our laboratory. The results obtained show that the percentage energy distribution in a speech signal is a very powerful parameter in the classification of Arabic fricatives. We obtained an accuracy of 92% for non-sibilant consonants /f, χ, ɣ, ʕ, ћ, and h/, 84% for sibilants /s, sҁ, z, Ӡ and ∫/, and 89% for the whole classification rate. In comparison to other algorithms based on neural networks and support vector machines (SVM), our classification system was able to provide a higher classification rate

    An HCI Speech-Based Architecture for Man-To-Machine and Machine-To-Man Communication in Yorùbá Language

    Get PDF
    Man communicates with man by natural language, sign language, and/or gesture but communicates with machine via electromechanical devices such as mouse, and keyboard.  These media of effecting Man-To-Machine (M2M) communication are electromechanical in nature. Recent research works, however, have been able to achieve some high level of success in M2M using natural language, sign language, and/or gesture under constrained conditions. However, machine communication with man, in reverse direction, using natural language is still at its infancy. Machine communicates with man usually in textual form. In order to achieve acceptable quality of end-to-end M2M communication, there is need for robust architecture to develop a novel speech-to-text and text-to-speech system. In this paper, an HCI speech-based architecture for Man-To-Machine and Machine-To-Man communication in Yorùbá language is proposed to carry Yorùbá people along in the advancement taking place in the world of Information Technology. Dynamic Time Warp is specified in the model to measure the similarity between the voice utterances in the sound library. In addition, Vector Quantization, Guassian Mixture Model and Hidden Markov Model are incorporated in the proposed architecture for compression and observation. This approach will yield a robust Speech-To-Text and Text-To-Speech system. Keywords: Yorùbá Language, Speech Recognition, Text-To-Speech, Man-To-Machine, Machine-To-Ma

    Open-set Speaker Identification

    Get PDF
    This study is motivated by the growing need for effective extraction of intelligence and evidence from audio recordings in the fight against crime, a need made ever more apparent with the recent expansion of criminal and terrorist organisations. The main focus is to enhance open-set speaker identification process within the speaker identification systems, which are affected by noisy audio data obtained under uncontrolled environments such as in the street, in restaurants or other places of businesses. Consequently, two investigations are initially carried out including the effects of environmental noise on the accuracy of open-set speaker recognition, which thoroughly cover relevant conditions in the considered application areas, such as variable training data length, background noise and real world noise, and the effects of short and varied duration reference data in open-set speaker recognition. The investigations led to a novel method termed “vowel boosting” to enhance the reliability in speaker identification when operating with varied duration speech data under uncontrolled conditions. Vowels naturally contain more speaker specific information. Therefore, by emphasising this natural phenomenon in speech data, it enables better identification performance. The traditional state-of-the-art GMM-UBMs and i-vectors are used to evaluate “vowel boosting”. The proposed approach boosts the impact of the vowels on the speaker scores, which improves the recognition accuracy for the specific case of open-set identification with short and varied duration of speech material

    Embedded Speech Technology

    Get PDF
    openEnd-to-End models in Automatic Speech Recognition simplify the speech recognition process. They convert audio data directly into text representation without exploiting multiple stages and systems. This direct approach is efficient and reduces potential points of error. On the contrary, Sequence-to-Sequence models adopt a more integrative approach where they use distinct models for retrieving the acoustic and language-specific features, which are respectively known as acoustic and language models. This integration allows for better coordination between different speech aspects, potentially leading to more accurate transcriptions. In this thesis, we explore various Speech-to-Text (STT) models, mainly focusing on End-to-End and Sequence-to-Sequence techniques. We also look into using offline STT tools such as Wav2Vec2.0, Kaldi and Vosk. These tools face challenges when handling new voice data or various accents of the same language. To address this challenge, we fine-tune the models to make them better at handling new, unseen data. Through our comparison, Wav2Vec2.0 emerged as the top performer, though with a larger model size. Our approach also proves that using Kaldi and Vosk together creates a robust STT system that can identify new words using phonemes

    Some Advances in Nonlinear Speech Modeling Using Modulations, Fractals, and Chaos

    Get PDF
    In this paper we briefly summarize our on-going work on modeling nonlinear structures in speech signals, caused by modulation and turbulence phenomena, using the theories of modulation, fractals, and chaos as well as suitable nonlinear signal analysis methods. Further, we focus on two advances: i) AM-FM modeling of fricative sounds with random modulation signals of the 1/f-noise type and ii) improved methods for speech analysis and prediction on reconstructed multidimensional attractors. 1

    Frame-level features conveying phonetic information for language and speaker recognition

    Get PDF
    150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems

    A survey on mouth modeling and analysis for Sign Language recognition

    Get PDF
    © 2015 IEEE.Around 70 million Deaf worldwide use Sign Languages (SLs) as their native languages. At the same time, they have limited reading/writing skills in the spoken language. This puts them at a severe disadvantage in many contexts, including education, work, usage of computers and the Internet. Automatic Sign Language Recognition (ASLR) can support the Deaf in many ways, e.g. by enabling the development of systems for Human-Computer Interaction in SL and translation between sign and spoken language. Research in ASLR usually revolves around automatic understanding of manual signs. Recently, ASLR research community has started to appreciate the importance of non-manuals, since they are related to the lexical meaning of a sign, the syntax and the prosody. Nonmanuals include body and head pose, movement of the eyebrows and the eyes, as well as blinks and squints. Arguably, the mouth is one of the most involved parts of the face in non-manuals. Mouth actions related to ASLR can be either mouthings, i.e. visual syllables with the mouth while signing, or non-verbal mouth gestures. Both are very important in ASLR. In this paper, we present the first survey on mouth non-manuals in ASLR. We start by showing why mouth motion is important in SL and the relevant techniques that exist within ASLR. Since limited research has been conducted regarding automatic analysis of mouth motion in the context of ALSR, we proceed by surveying relevant techniques from the areas of automatic mouth expression and visual speech recognition which can be applied to the task. Finally, we conclude by presenting the challenges and potentials of automatic analysis of mouth motion in the context of ASLR

    Personalizing Human-Robot Dialogue Interactions using Face and Name Recognition

    Get PDF
    Task-oriented dialogue systems are computer systems that aim to provide an interaction indistinguishable from ordinary human conversation with the goal of completing user- defined tasks. They are achieving this by analyzing the intents of users and choosing respective responses. Recent studies show that by personalizing the conversations with this systems one can positevely affect their perception and long-term acceptance. Personalised social robots have been widely applied in different fields to provide assistance. In this thesis we are working on development of a scientific conference assistant. The goal of this assistant is to provide the conference participants with conference information and inform about the activities for their spare time during conference. Moreover, to increase the engagement with the robot our team has worked on personalizing the human-robot interaction by means of face and name recognition. To achieve this personalisation, first the name recognition ability of available physical robot was improved, next by the concent of the participants their pictures were taken and used for memorization of returning users. As acquiring the consent for personal data storage is not an optimal solution, an alternative method for participants recognition using QR Codes on their badges was developed and compared to pre-trained model in terms of speed. Lastly, the personal details of each participant, as unviversity, country of origin, was acquired prior to conference or during the conversation and used in dialogues. The developed robot, called DAGFINN was displayed at two conferences happened this year in Stavanger, where the first time installment did not involve personalization feature. Hence, we conclude this thesis by discussing the influence of personalisation on dialogues with the robot and participants satisfaction with developed social robot
    corecore