312 research outputs found

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    COMMUNICATING IN SOCIAL NETWORKS: EFFECTS OF REVERBERATION ON ACOUSTIC INFORMATION TRANSFER IN THREE SPECIES OF BIRDS

    Get PDF
    In socially and acoustically complex environments the auditory system processes sounds that are distorted, attenuated and additionally masked by biotic and abiotic noise. As a result, spectral and temporal alterations of the sounds may affect the transfer of information between signalers and receivers in networks of conspecifics, increasing detection thresholds and interfering with the discrimination and recognition of sound sources. To this day, much concern has been directed toward anthropogenic noise sources and whether they affect the animals' natural territorial and reproductive behavior and ultimately harm the survival of the species. Not much is known, however, about the potentially synergistic effects of environmentally-induced sound degradation, masking from noise and competing sound signals, and what implications these interactions bear for vocally-mediated exchanges in animals. This dissertation describes a series of comparative, psychophysical experiments in controlled laboratory conditions to investigate the impact of reverberation on the perception of a range of artificial sounds and natural vocalizations in the budgerigar, canary, and zebra finch. Results suggest that even small reverberation effects could be used to gauge different acoustic environments and to locate a sound source but limit the vocally-mediated transfer of important information in social settings, especially when reverberation is paired with noise. Discrimination of similar vocalizations from different individuals is significantly impaired when both reverberation and abiotic noise levels are high, whereas this ability is hardly affected by either of these factors alone. Similarly, high levels of reverberation combined with biotic noise from signaling conspecifics limit the auditory system's ability to parse a complex acoustic scene by segregating signals from multiple individuals. Important interaction effects like these caused by the characteristics of the habitat and species differences in auditory sensitivity therefore can predict whether a given acoustic environment limits communication range or interferes with the detection, discrimination, and recognition of biologically important sounds

    The selective use of gaze in automatic speech recognition

    Get PDF
    The performance of automatic speech recognition (ASR) degrades significantly in natural environments compared to in laboratory assessments. Being a major source of interference, acoustic noise affects speech intelligibility during the ASR process. There are two main problems caused by the acoustic noise. The first is the speech signal contamination. The second is the speakers' vocal and non-vocal behavioural changes. These phenomena elicit mismatch between the ASR training and recognition conditions, which leads to considerable performance degradation. To improve noise-robustness, exploiting prior knowledge of the acoustic noise in speech enhancement, feature extraction and recognition models are popular approaches. An alternative approach presented in this thesis is to introduce eye gaze as an extra modality. Eye gaze behaviours have roles in interaction and contain information about cognition and visual attention; not all behaviours are relevant to speech. Therefore, gaze behaviours are used selectively to improve ASR performance. This is achieved by inference procedures using noise-dependant models of gaze behaviours and their temporal and semantic relationship with speech. `Selective gaze-contingent ASR' systems are proposed and evaluated on a corpus of eye movement and related speech in different clean, noisy environments. The best performing systems utilise both acoustic and language model adaptation

    Physical mechanisms may be as important as brain mechanisms in evolution of speech [Commentary on Ackerman, Hage, & Ziegler. Brain Mechanisms of acoustic communication in humans and nonhuman primates: an evolutionary perspective]

    No full text
    We present two arguments why physical adaptations for vocalization may be as important as neural adaptations. First, fine control over vocalization is not easy for physical reasons, and modern humans may be exceptional. Second, we present an example of a gorilla that shows rudimentary voluntary control over vocalization, indicating that some neural control is already shared with great apes

    Brain mechanisms of acoustic communication in humans and nonhuman primates: An evolutionary perspective

    Get PDF
    Any account of “what is special about the human brain” (Passingham 2008) must specify the neural basis of our unique ability to produce speech and delineate how these remarkable motor capabilities could have emerged in our hominin ancestors. Clinical data suggest that the basal ganglia provide a platform for the integration of primate-general mechanisms of acoustic communication with the faculty of articulate speech in humans. Furthermore, neurobiological and paleoanthropological data point at a two-stage model of the phylogenetic evolution of this crucial prerequisite of spoken language: (i) monosynaptic refinement of the projections of motor cortex to the brainstem nuclei that steer laryngeal muscles, presumably, as part of a “phylogenetic trend” associated with increasing brain size during hominin evolution; (ii) subsequent vocal-laryngeal elaboration of cortico-basal ganglia circuitries, driven by human-specific FOXP2 mutations.;>This concept implies vocal continuity of spoken language evolution at the motor level, elucidating the deep entrenchment of articulate speech into a “nonverbal matrix” (Ingold 1994), which is not accounted for by gestural-origin theories. Moreover, it provides a solution to the question for the adaptive value of the “first word” (Bickerton 2009) since even the earliest and most simple verbal utterances must have increased the versatility of vocal displays afforded by the preceding elaboration of monosynaptic corticobulbar tracts, giving rise to enhanced social cooperation and prestige. At the ontogenetic level, the proposed model assumes age-dependent interactions between the basal ganglia and their cortical targets, similar to vocal learning in some songbirds. In this view, the emergence of articulate speech builds on the “renaissance” of an ancient organizational principle and, hence, may represent an example of “evolutionary tinkering” (Jacob 1977)

    Neural representation of speech segmentation and syntactic structure discrimination

    Get PDF

    Control of Vocal Production in Budgerigars (Melopsittacus undulatus)

    Get PDF
    Budgerigars engage in dynamic vocal interactions with conspecifics, learn their vocalizations in a rich social environment, and rely to some extent on auditory feedback to acquire and maintain normal vocal output. However, little is known about the exact role of sensory input and sensory feedback in the control of vocal production in these birds. For example, we know that these birds learn best in a social environment that contains both auditory and visual information, yet we know very little about how this information guides and influences vocal production. Although we suspect that budgerigars rely on auditory feedback for the learning and maintenance of vocal behavior, we do not know whether there are refined, compensatory feedback mechanisms similar to that of humans. Finally, we do not know whether, or to what extent, calls can be modified in structure during learning. This dissertation describes a series of experiments that use more highly controlled and regimented conditions than previous studies with songbirds to investigate the control of vocal production in budgerigars and to provide a more detailed description of some of the mechanisms underlying vocal learning in budgerigars

    Processing of Degraded Speech in Brain Disorders

    Get PDF
    The speech we hear every day is typically “degraded” by competing sounds and the idiosyncratic vocal characteristics of individual speakers. While the comprehension of “degraded” speech is normally automatic, it depends on dynamic and adaptive processing across distributed neural networks. This presents the brain with an immense computational challenge, making degraded speech processing vulnerable to a range of brain disorders. Therefore, it is likely to be a sensitive marker of neural circuit dysfunction and an index of retained neural plasticity. Considering experimental methods for studying degraded speech and factors that affect its processing in healthy individuals, we review the evidence for altered degraded speech processing in major neurodegenerative diseases, traumatic brain injury and stroke. We develop a predictive coding framework for understanding deficits of degraded speech processing in these disorders, focussing on the “language-led dementias”—the primary progressive aphasias. We conclude by considering prospects for using degraded speech as a probe of language network pathophysiology, a diagnostic tool and a target for therapeutic intervention

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies

    Functional neuroimaging of human vocalizations and affective speech

    Get PDF
    Neuroimaging studies have verified the important integrative role of the basal ganglia during affective vocalizations. They, however, also point to additional regions supporting vocal monitoring, auditory-motor feedback processing, and online adjustments of vocal motor responses. For the case of affective vocalizations, we suggest partly extending the model to fully consider the link between primate-general and human-specific neural component
    • …
    corecore