375 research outputs found

    Syllable classification using static matrices and prosodic features

    Get PDF
    In this paper we explore the usefulness of prosodic features for syllable classification. In order to do this, we represent the syllable as a static analysis unit such that its acoustic-temporal dynamics could be merged into a set of features that the SVM classifier will consider as a whole. In the first part of our experiment we used MFCC as features for classification, obtaining a maximum accuracy of 86.66%. The second part of our study tests whether the prosodic information is complementary to the cepstral information for syllable classification. The results obtained show that combining the two types of information does improve the classification, but further analysis is necessary for a more successful combination of the two types of features

    Aspects of Application of Neural Recognition to Digital Editions

    Get PDF
    Artificial neuronal networks (ANN) are widely used in software systems which require solutions to problems without a traditional algorithmic approach, like in character recognition: ANN learn by example, so that they require a consistent and well-chosen set of samples to be trained to recognize their patterns. The network is taught to react with high activity in some of its output neurons whenever an input sample belonging to a specified class (e.g. a letter shape) is presented, and has the ability to assess the similarity of samples never encountered before by any of these models. Typical OCR applications thus require a significant amount of preprocessing for such samples, like resizing images and removing all the "noise" data, letting the letter contours emerge clearly from the background. Furthermore, usually a huge number of samples is required to effectively train a network to recognize a character against all the others. This may represent an issue for palaeographical applications because of the relatively low quantity and high complexity of digital samples available, and poses even more problems when our aim is detecting subtle differences (e.g. the special shape of a specific letter from a well-defined period and scriptorium). It would be probably wiser for scholars to define some guidelines for extracting from samples the features defined as most relevant according to their purposes, and let the network deal with just a subset of the overwhelming amount of detailed nuances available. ANN are no magic, and it is always the careful judgement of scholars to provide a theoretical foundation for any computer-based tool they might want to use to help them solve their problems: we can easily illustrate this point with samples drawn from any other application of IT to humanities. Just as we can expect no magic in detecting alliterations in a text if we simply feed a system with a collection of letters, we can no more claim that a neural recognition system might be able to perform well with a relatively small sample where each shape is fed as it is, without instructing the system about the features scholars define as relevant. Even before ANN implementations, it is exactly this theoretical background which must be put to the test when planning such systems

    Vowel Recognition from Articulatory Position Time-Series Data

    Get PDF
    A new approach of recognizing vowels from articulatory position time-series data was proposed and tested in this paper. This approach directly mapped articulatory position time-series data to vowels without extracting articulatory features such as mouth opening. The input time-series data were time-normalized and sampled to fixed-width vectors of articulatory positions. Three commonly used classifiers, Neural Network, Support Vector Machine and Decision Tree were used and their performances were compared on the vectors. A single speaker dataset of eight major English vowels acquired using Electromagnetic Articulograph (EMA) AG500 was used. Recognition rate using cross validation ranged from 76.07% to 91.32% for the three classifiers. In addition, the trained decision trees were consistent with articulatory features commonly used to descriptively distinguish vowels in classical phonetics. The findings are intended to improve the accuracy and response time of a real-time articulatory-to-acoustics synthesizer

    Π-Avida -A Personalized Interactive Audio and Video Portal

    Get PDF
    Abstract We describe a system for enregistering, storing and distributing multimedia data streams. For each modality -audio, speech, video -characteristic features are extracted and used to classify the content into a range of topic categories. Using data mining techniques classifier models are determined from training data. These models are able to assign existing and new multimedia documents to one or several topic categories. We describe the features used as inputs for these classifiers. We demonstrate that the classification of audio material may be improved by using phonemes and syllables instead of words. Finally we show that the categorization performance mainly depends on the quality of speech recognition and that the simple video features we tested are of only marginal utility

    High Level Speaker Specific Features as an Efficiency Enhancing Parameters in Speaker Recognition System

    Get PDF
    In this paper, I present high-level speaker specific feature extraction considering intonation, linguistics rhythm, linguistics stress, prosodic features directly from speech signals. I assume that the rhythm is related to language units such as syllables and appears as changes in measurable parameters such as fundamental frequency (  ), duration, and energy. In this work, the syllable type features are selected as the basic unit for expressing the prosodic features. The approximate segmentation of continuous speech to syllable units is achieved by automatically locating the vowel starting point. The knowledge of high-level speaker’s specific speakers is used as a reference for extracting the prosodic features of the speech signal. High-level speaker-specific features extracted using this method may be useful in applications such as speaker recognition where explicit phoneme/syllable boundaries are not readily available. The efficiency of the particular characteristics of the specific features used for automatic speaker recognition was evaluated on TIMIT and HTIMIT corpora initially sampled in the TIMIT at 16 kHz to 8 kHz. In summary, the experiment, the basic discriminating system, and the HMM system are formed on TIMIT corpus with a set of 48 phonemes. Proposed ASR system shows 1.99%, 2.10%,  2.16%  and  2.19 % of efficiency improvements compared to traditional ASR system for and of 16KHz TIMIT utterances

    Articulatory features for robust visual speech recognition

    Full text link

    Classification of Arabic fricative consonants according to their places of articulation

    Get PDF
    Many technology systems have used voice recognition applications to transcribe a speaker’s speech into text that can be used by these systems. One of the most complex tasks in speech identification is to know, which acoustic cues will be used to classify sounds. This study presents an approach for characterizing Arabic fricative consonants in two groups (sibilant and non-sibilant). From an acoustic point of view, our approach is based on the analysis of the energy distribution, in frequency bands, in a syllable of the consonant-vowel type. From a practical point of view, our technique has been implemented, in the MATLAB software, and tested on a corpus built in our laboratory. The results obtained show that the percentage energy distribution in a speech signal is a very powerful parameter in the classification of Arabic fricatives. We obtained an accuracy of 92% for non-sibilant consonants /f, χ, ɣ, ʕ, ћ, and h/, 84% for sibilants /s, sҁ, z, Ӡ and ∫/, and 89% for the whole classification rate. In comparison to other algorithms based on neural networks and support vector machines (SVM), our classification system was able to provide a higher classification rate
    corecore