996 research outputs found

    Integrated speech and morphological processing in a connectionist continuous speech understanding for Korean

    Full text link
    A new tightly coupled speech and natural language integration model is presented for a TDNN-based continuous possibly large vocabulary speech recognition system for Korean. Unlike popular n-best techniques developed for integrating mainly HMM-based speech recognition and natural language processing in a {\em word level}, which is obviously inadequate for morphologically complex agglutinative languages, our model constructs a spoken language system based on a {\em morpheme-level} speech and language integration. With this integration scheme, the spoken Korean processing engine (SKOPE) is designed and implemented using a TDNN-based diphone recognition module integrated with a Viterbi-based lexical decoding and symbolic phonological/morphological co-analysis. Our experiment results show that the speaker-dependent continuous {\em eojeol} (Korean word) recognition and integrated morphological analysis can be achieved with over 80.6% success rate directly from speech inputs for the middle-level vocabularies.Comment: latex source with a4 style, 15 pages, to be published in computer processing of oriental language journa

    Vowel synthesis using feed-forward neural networks

    Get PDF

    On Using Backpropagation for Speech Texture Generation and Voice Conversion

    Full text link
    Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201

    Aspects of Application of Neural Recognition to Digital Editions

    Get PDF
    Artificial neuronal networks (ANN) are widely used in software systems which require solutions to problems without a traditional algorithmic approach, like in character recognition: ANN learn by example, so that they require a consistent and well-chosen set of samples to be trained to recognize their patterns. The network is taught to react with high activity in some of its output neurons whenever an input sample belonging to a specified class (e.g. a letter shape) is presented, and has the ability to assess the similarity of samples never encountered before by any of these models. Typical OCR applications thus require a significant amount of preprocessing for such samples, like resizing images and removing all the "noise" data, letting the letter contours emerge clearly from the background. Furthermore, usually a huge number of samples is required to effectively train a network to recognize a character against all the others. This may represent an issue for palaeographical applications because of the relatively low quantity and high complexity of digital samples available, and poses even more problems when our aim is detecting subtle differences (e.g. the special shape of a specific letter from a well-defined period and scriptorium). It would be probably wiser for scholars to define some guidelines for extracting from samples the features defined as most relevant according to their purposes, and let the network deal with just a subset of the overwhelming amount of detailed nuances available. ANN are no magic, and it is always the careful judgement of scholars to provide a theoretical foundation for any computer-based tool they might want to use to help them solve their problems: we can easily illustrate this point with samples drawn from any other application of IT to humanities. Just as we can expect no magic in detecting alliterations in a text if we simply feed a system with a collection of letters, we can no more claim that a neural recognition system might be able to perform well with a relatively small sample where each shape is fed as it is, without instructing the system about the features scholars define as relevant. Even before ANN implementations, it is exactly this theoretical background which must be put to the test when planning such systems

    Speech intelligibility estimation via neural networks

    Get PDF
    Current methods of speech intelligibility estimation rely on the subjective judgements of trained listeners. Accurate and unbiased intelligibility estimates have a number of procedural and/or methodological constraints including the necessity for large pools of listeners and a wide variety of stimulus materials. Recent research findings however, have shown a strong relationship between speech intelligibility estimates and selected acoustic speech parameters which appear to determine the intelligibility of speech. These findings suggest that such acoustic speech parameters could be used to derive computer-based speech intelligibility estimation, obviating the procedural and methodological constraints typically associated with such estimates. The relationship between speech intelligibility estimates and acoustic speech parameters is complex and nonlinear in nature. Artificial neural networks have proven in general speech recognition that they are capable of dealing with complex and unspecified nonlinear relationships. The purpose of this study was to explore the possibility of using artificial neural networks to make speech intelligibility estimates. Sixty hearing-impaired speakers, whose measured speech intelligibility ranged from 0 to 99%, were used as subjects in this study. In addition to measuring speech intelligibility, the speech of these subjects was digitally analyzed to obtain 6 acoustic speech parameters that have been found to critically differentiate English phonemes. The subjects were divided into two sub-groups. One of the subgroups was used to train a variety of back-propagation neural networks and the other was used to test the ability of the neural networks to make accurate speaker-independent speech intelligibility estimates. The artificial neural network that seemed to be the most efficatious for making speaker-independent speech intelligibility estimates employed a bipolar squash function and scaled values of the speech parameters. Compared to listener judgements the overall accuracy of the network\u27s speech intelligibility estimates was a respectable 83%. These findings suggest that with expanded subject populations and more acoustic speech parameters it might be possible to create a practical computer based tool capable of objectively determining speech intelligibility

    The Microsoft 2017 Conversational Speech Recognition System

    Full text link
    We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM language models in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set

    Morphological Analysis as Classification: an Inductive-Learning Approach

    Full text link
    Morphological analysis is an important subtask in text-to-speech conversion, hyphenation, and other language engineering tasks. The traditional approach to performing morphological analysis is to combine a morpheme lexicon, sets of (linguistic) rules, and heuristics to find a most probable analysis. In contrast we present an inductive learning approach in which morphological analysis is reformulated as a segmentation task. We report on a number of experiments in which five inductive learning algorithms are applied to three variations of the task of morphological analysis. Results show (i) that the generalisation performance of the algorithms is good, and (ii) that the lazy learning algorithm IB1-IG performs best on all three tasks. We conclude that lazy learning of morphological analysis as a classification task is indeed a viable approach; moreover, it has the strong advantages over the traditional approach of avoiding the knowledge-acquisition bottleneck, being fast and deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP proceedings style nemlap.sty; inputs ipamacs (international phonetic alphabet) and epsf macro

    Consonant Recognition by Modular Constructlon of Large Phonemic Time-Delay Neural Networks

    Get PDF
    Abstract In this paper we show that neural networks for speech recognition can be constructed in a modular fashbn by expbiting the hidden structure of prevbusly trained phonetic subcategory networks. The performance of resulting larger phonetic nets was found to be as good as the performance of the subcomponent nets by themselves. This approach avoids the excessive learning times that would ba necessary to train larger networks and allows for incremental learning. Large time-delay neural networks constructed incrementally by applying these modular training techniques achieved a rewgnition performance of 96.0% for ail consonants and 94.7% for all phonemes. Introductlon Recently we have demonstrated that connectionist architectures capable of capturing some critical aspects of the dynamic nature of speech, can achieve superior recognition performance for daficuit but small phonemic discrimination tasks such as discrimination of the voiced consonants B, D and G 11.21. Encouraged by these results we wanted to explore the vestion, how we might expand on these models to make them useful for the design of speech recognition systems. A problem that emerges as we attempt to apply neural network models to the full speech recognition problem is the problem of scaling. Simply extending neural networks to ever larger structures and retraining them as one monolithic net quickly exceeds the capabilities of the fastest and largest supercomputers. The search complexity of finding a g w d solutions in a huge space of possible network configurations also soon assumes unmanageable proportions. Moreover, having to decide on all possible classes for recognition ahead of time as well as collecting sufficient data to train such a large monolithic network is impractical to say the least. In an effort to extend our models from mail recognition tasks to large scale speech recognition systems, we must therefore explore modularity and incremental learning as design strategies to break up a large learning task into smaller subtasks. Breaking up a large task into subtasks to be tackled by individual black boxes interconnected in ad hoc arrangements, on the other hand, would mean to abandon one of the most anractiie aspects of connectionism: the ability to perform complex constraint satistaction in a massively parallel and interconnected fashion, in view of an overall optimal performance goal. In this paper we demonstrate based on a set of experiments aimed at phoneme recognition that it is indeed possible to consttuct large neural networks incrementally by exploiting the hidden structure of smaller pretrained subcomponent networks. Small Phonemlc Classes by Time-Delay Neural In our previous work. we have proposed a Time-Delay Neural Network architecture (as shown on the left of Its mltilayer architecture, its shim- Osaka. 540, Japan invariance and the time delayed connectins of its units all contributed to its performance by allowing the net to develop complex, non-linear decislon surfaces and insensitivity to misalignments and by incorporating contextual information into decision making (see [l, 21 for detailed analysis and discussion). It is trained by the back-propagation pmcedure[3] using shared weighls for different time shifted positions of the net [l, 21. In spirit it has simi!arities to other models recently proposed [4,5]. This network. however, had only been trained for the voiced stops 6,D.G and we began our extensions by training similar networks for the other phonemic classes in our database. Ail phoneme tokens in our experiments were extracted using phonetic handlabels from a large vocabulary database of 5240 common Japanese words. Each word in the database was spoken in isolation by one male native Japanese speaker. All utterances were recorded in a sound pmoi booth and digitized at a 12 kHz sampling rate. The database was then split into a training set and a testing set of 2620 utterances each. A 150 msec range around a phoneme boundary was excised for each phoneme token and 1
    corecore