58,273 research outputs found

    Voicing classification of visual speech using convolutional neural networks

    Get PDF
    The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios. A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively. The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Gone in Sixty Milliseconds: Trademark Law and Cognitive Science

    Get PDF
    Trademark dilution is a cause of action for interfering with the uniqueness of a trademark. For example, consumers would probably not think that Kodak soap was produced by the makers of Kodak cameras, but its presence in the market would diminish the uniqueness of the original Kodak mark. Trademark owners think dilution is harmful but have had difficulty explaining why. Many courts have therefore been reluctant to enforce dilution laws, even while legislatures have enacted more of them over the past half century. Courts and commentators have now begun to use psychological theories, drawing on associationist models of cognition, to explain how a trademark can be harmed by the existence of similar marks even when consumers can readily distinguish the marks from one another and thus are not confused. Though the cognitive theory of dilution is internally consistent and appeals to the authority of science, it does not rest on sufficient empirical evidence to justify its adoption. Moreover, the harms it identifies do not generally come from commercial competitors but from free speech about trademarked products. As a result, even a limited dilution law should be held unconstitutional under current First Amendment commercial-speech doctrine. In the absence of constitutional invalidation, the cognitive explanation of dilution is likely to change the law for the worse. Rather than working like fingerprint evidence--which ideally produces more evidence about already-defined crimes--psychological explanations of dilution are more like economic theories in antitrust, which changed the definition of actionable restraints of trade. Given the empirical and normative flaws in the cognitive theory, using it to fill dilution\u27s theoretical vacuum would be a mistake

    Automatic Quality Estimation for ASR System Combination

    Get PDF
    Recognizer Output Voting Error Reduction (ROVER) has been widely used for system combination in automatic speech recognition (ASR). In order to select the most appropriate words to insert at each position in the output transcriptions, some ROVER extensions rely on critical information such as confidence scores and other ASR decoder features. This information, which is not always available, highly depends on the decoding process and sometimes tends to over estimate the real quality of the recognized words. In this paper we propose a novel variant of ROVER that takes advantage of ASR quality estimation (QE) for ranking the transcriptions at "segment level" instead of: i) relying on confidence scores, or ii) feeding ROVER with randomly ordered hypotheses. We first introduce an effective set of features to compensate for the absence of ASR decoder information. Then, we apply QE techniques to perform accurate hypothesis ranking at segment-level before starting the fusion process. The evaluation is carried out on two different tasks, in which we respectively combine hypotheses coming from independent ASR systems and multi-microphone recordings. In both tasks, it is assumed that the ASR decoder information is not available. The proposed approach significantly outperforms standard ROVER and it is competitive with two strong oracles that e xploit prior knowledge about the real quality of the hypotheses to be combined. Compared to standard ROVER, the abs olute WER improvements in the two evaluation scenarios range from 0.5% to 7.3%

    Acceptance of mobile services - insights from the Swedish market for mobile telephony

    Get PDF
    The main purpose of the paper is to investigate young peoples’ perspectives on mobile services in order to shed light on the acceptance of mobile services. The knowledge of and interest in mobile services of individuals using such services is analyzed. A second objective is to investigate the reasons for using/not using mobile services. In-depth focus group interviews and secondary empirical data provide the main data. Concerning the youth’s general knowledge of and interest in mobile services, the results point to six things: young people show a low demand for many mobile services, there is a demand for extended, established mobile services, like SMS, the interest in the new services vary, there is low interest in active information search, there is little knowledge of the enabling technology, and the understanding of the pricing is generally low. As concerns reasons for and against usage of mobile services, results point to four central aspects: many individuals could present clearly defined needs for certain services, many indicated an interest in "community usage" of mobile services, they experienced the prices of mobile services to be a hinder for usage, and technology placed limitations on the usage. The paper discusses practical implications on the acceptance of mobile services.Mobile services; mobility; focus groups; telecommunications; wireless; knowledge

    DNN adaptation by automatic quality estimation of ASR hypotheses

    Full text link
    In this paper we propose to exploit the automatic Quality Estimation (QE) of ASR hypotheses to perform the unsupervised adaptation of a deep neural network modeling acoustic probabilities. Our hypothesis is that significant improvements can be achieved by: i)automatically transcribing the evaluation data we are currently trying to recognise, and ii) selecting from it a subset of "good quality" instances based on the word error rate (WER) scores predicted by a QE component. To validate this hypothesis, we run several experiments on the evaluation data sets released for the CHiME-3 challenge. First, we operate in oracle conditions in which manual transcriptions of the evaluation data are available, thus allowing us to compute the "true" sentence WER. In this scenario, we perform the adaptation with variable amounts of data, which are characterised by different levels of quality. Then, we move to realistic conditions in which the manual transcriptions of the evaluation data are not available. In this case, the adaptation is performed on data selected according to the WER scores "predicted" by a QE component. Our results indicate that: i) QE predictions allow us to closely approximate the adaptation results obtained in oracle conditions, and ii) the overall ASR performance based on the proposed QE-driven adaptation method is significantly better than the strong, most recent, CHiME-3 baseline.Comment: Computer Speech & Language December 201
    • …
    corecore