70 research outputs found

    Voice activity detection in eco-acoustic data enables privacy protection and is a proxy for human disturbance

    Get PDF
    1. Eco-acoustic monitoring is increasingly being used to map biodiversity across large scales, yet little thought is given to the privacy concerns and potential scientific value of inadvertently recorded human speech. Automated speech de tection is possible using voice activity detection (VAD) models, but it is not clear how well these perform in diverse natural soundscapes. In this study we pre sent the first evaluation of VAD models for anonymization of eco-acoustic data and demonstrate how speech detection frequency can be used as one potential measure of human disturbance. 2. We first generated multiple synthetic datasets using different data preprocess ing techniques to train and validate deep neural network models. We evaluated the performance of our custom models against existing state-of-the-art VAD models using playback experiments with speech samples from a man, woman and child. Finally, we collected long-term data from a Norwegian forest heavily used for hiking to evaluate the ability of the models to detect human speech and quantify a proxy for human disturbance in a real monitoring scenario. 3. In playback experiments, all models could detect human speech with high accu racy at distances where the speech was intelligible (up to 10 m). We showed that training models using location specific soundscapes in the data preprocessing step resulted in a slight improvement in model performance. Additionally, we found that the number of speech detections correlated with peak traffic hours (using bus timings) demonstrating how VAD can be used to derive a proxy for human disturbance with fine temporal resolution. 4. Anonymizing audio data effectively using VAD models will allow eco-acoustic monitoring to continue to deliver invaluable ecological insight at scale, while minimizing the risk of data misuse. Furthermore, using speech detections as a proxy for human disturbance opens new opportunities for eco-acoustic moni toring to shed light on nuanced human–wildlife interactionspublishedVersio

    Visual Voice Activity Detection in the Wild

    Get PDF

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Artificial generation of partial discharge sources through an algorithm based on deep convolutional generative adversarial networks.

    Get PDF
    The measurement of partial discharges (PD) in electrical equipment or machines subjected to high voltage can be considered as one of the most important indicators when assessing the state of an insulation system. One of the main challenges in monitoring these degradation phenomena is to adequately measure a statistically significant number of signals from each of the sources acting on the asset under test. However, in industrial environments the presence of large amplitude noise sources or the simultaneous presence of multiple PD sources may limit the acquisition of the signals and therefore the final diagnosis of the equipment status may not be the most accurate. Although different procedures and separation and identification techniques have been implemented with very good results, not having a significant number of PD pulses associated with each source can limit the effectiveness of these procedures. Based on the above, this research proposes a new algorithm of artificial generation of PD based on a Deep Convolutional Generative Adversarial Networks (DCGAN) architecture which allows artificially generating different sources of PD from a small group of real PD, in order to complement those sources that during the measurement were poorly represented in terms of signals. According to the results obtained in different experiments, the temporal and spectral behavior of artificially generated PD sources proved to be similar to that of real experimentally obtained sources

    Oesophageal speech: enrichment and evaluations

    Get PDF
    167 p.After a laryngectomy (i.e. removal of the larynx) a patient can no more speak in a healthy laryngeal voice. Therefore, they need to adopt alternative methods of speaking such as oesophageal speech. In this method, speech is produced using swallowed air and the vibrations of the pharyngo-oesophageal segment, which introduces several undesired artefacts and an abnormal fundamental frequency. This makes oesophageal speech processing difficult compared to healthy speech, both auditory processing and signal processing. The aim of this thesis is to find solutions to make oesophageal speech signals easier to process, and to evaluate these solutions by exploring a wide range of evaluation metrics.First, some preliminary studies were performed to compare oesophageal speech and healthy speech. This revealed significantly lower intelligibility and higher listening effort for oesophageal speech compared to healthy speech. Intelligibility scores were comparable for familiar and non-familiar listeners of oesophageal speech. However, listeners familiar with oesophageal speech reported less effort compared to non-familiar listeners. In another experiment, oesophageal speech was reported to have more listening effort compared to healthy speech even though its intelligibility was comparable to healthy speech. On investigating neural correlates of listening effort (i.e. alpha power) using electroencephalography, a higher alpha power was observed for oesophageal speech compared to healthy speech, indicating higher listening effort. Additionally, participants with poorer cognitive abilities (i.e. working memory capacity) showed higher alpha power.Next, using several algorithms (preexisting as well as novel approaches), oesophageal speech was transformed with the aim of making it more intelligible and less effortful. The novel approach consisted of a deep neural network based voice conversion system where the source was oesophageal speech and the target was synthetic speech matched in duration with the source oesophageal speech. This helped in eliminating the source-target alignment process which is particularly prone to errors for disordered speech such as oesophageal speech. Both speaker dependent and speaker independent versions of this system were implemented. The outputs of the speaker dependent system had better short term objective intelligibility scores, automatic speech recognition performance and listener preference scores compared to unprocessed oesophageal speech. The speaker independent system had improvement in short term objective intelligibility scores but not in automatic speech recognition performance. Some other signal transformations were also performed to enhance oesophageal speech. These included removal of undesired artefacts and methods to improve fundamental frequency. Out of these methods, only removal of undesired silences had success to some degree (1.44 \% points improvement in automatic speech recognition performance), and that too only for low intelligibility oesophageal speech.Lastly, the output of these transformations were evaluated and compared with previous systems using an ensemble of evaluation metrics such as short term objective intelligibility, automatic speech recognition, subjective listening tests and neural measures obtained using electroencephalography. Results reveal that the proposed neural network based system outperformed previous systems in improving the objective intelligibility and automatic speech recognition performance of oesophageal speech. In the case of subjective evaluations, the results were mixed - some positive improvement in preference scores and no improvement in speech intelligibility and listening effort scores. Overall, the results demonstrate several possibilities and new paths to enrich oesophageal speech using modern machine learning algorithms. The outcomes would be beneficial to the disordered speech community

    Acoustic Echo Estimation using the model-based approach with Application to Spatial Map Construction in Robotics

    Get PDF

    Physiologically-Motivated Feature Extraction Methods for Speaker Recognition

    Get PDF
    Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks

    Papers for Task Force Meeting on Future and Impacts of Artificial Intelligence, 15-17 August 1983

    Get PDF
    IIASA's Clearinghouse activity is oriented towards issues of interest among our National Member Organizations. Here, in the forefront, are the issues concerning the promise and impact of science and technology on society and economy in general, and some selected branches in particular. Artificial Intelligence (AI) is one of the most promising research areas. There are many indications that the long predicted upswing of this discipline is finally in the making. A recent survey had Nobel-laureates predict that the most influence in the next century will be made by computers, AI, and robotics. Already, at present, "expert" systems are emerging and applied; natural language understanding systems developed; AI principles are used in robots, flexible automation, computer aided-design, etc. All this will have an, as yet, unspecified social and economic impact on the activity of human beings, both at work and leisure. It certainly takes interdisciplinary and cross-culturally based studies to enhance the understanding of this complex phenomenon. This is the aim of our endeavors in the field which is in excess of our duty to pass useful knowledge to our constituency. We think that IIASA, cooperating in this respect with the Austrian Society for Cybernetic Studies (ASCS), can develop some comparative advantage here. This publication contains papers written by leading personalities, both East and West, in the field of artificial intelligence on the future and impact of this emerging discipline. We hope that the meeting, where the papers will be discussed, will not only identify important areas where the impact of artificial intelligence will be felt most directly, but also find the most rewarding issues for further research
    corecore