139 research outputs found

    Deep learning techniques for biological signal processing: Automatic detection of dolphin sounds

    Get PDF
    openConsidering the heterogeneous underwater acoustic transmission context, detecting and distinguishing vocalizations of cetaceans has been a challenging area of recent interest. A promising venue to improve current detection systems is constituted by machine learning algorithms. In particular, Convolutional Neural Networks (CNNs) are considered one of the most promising deep learning techniques, since they have already excelled in problems involving the automatic processing of biological sounds. Human-annotated spectrograms can be used to teach CNNs how to distinguish between information in the time-frequency domain, thus enabling the detection and classification of marine mammal sounds. However, despite these promising capabilities machine learning suffers from a lack of labeled data, which calls for the adoption of transfer learning to create accurate models even when the availability of human taggers is limited. In this thesis, we developed a dolphin whistle detection framework based on deep learning models. In particular, we investigated the performance of large-scale pre-trained models (VGG16) and compared it with the performance of a vanilla Convolutional Neural Network and several baselines (logistic regression and Support Vector Machines). The pre-trained VGG16 model achieved the best detection performance, with an accuracy of 98,9\% on a left-out test dataset.Considering the heterogeneous underwater acoustic transmission context, detecting and distinguishing vocalizations of cetaceans has been a challenging area of recent interest. A promising venue to improve current detection systems is constituted by machine learning algorithms. In particular, Convolutional Neural Networks (CNNs) are considered one of the most promising deep learning techniques, since they have already excelled in problems involving the automatic processing of biological sounds. Human-annotated spectrograms can be used to teach CNNs how to distinguish between information in the time-frequency domain, thus enabling the detection and classification of marine mammal sounds. However, despite these promising capabilities machine learning suffers from a lack of labeled data, which calls for the adoption of transfer learning to create accurate models even when the availability of human taggers is limited. In this thesis, we developed a dolphin whistle detection framework based on deep learning models. In particular, we investigated the performance of large-scale pre-trained models (VGG16) and compared it with the performance of a vanilla Convolutional Neural Network and several baselines (logistic regression and Support Vector Machines). The pre-trained VGG16 model achieved the best detection performance, with an accuracy of 98,9\% on a left-out test dataset

    Privacy-preserving and Privacy-attacking Approaches for Speech and Audio -- A Survey

    Full text link
    In contemporary society, voice-controlled devices, such as smartphones and home assistants, have become pervasive due to their advanced capabilities and functionality. The always-on nature of their microphones offers users the convenience of readily accessing these devices. However, recent research and events have revealed that such voice-controlled devices are prone to various forms of malicious attacks, hence making it a growing concern for both users and researchers to safeguard against such attacks. Despite the numerous studies that have investigated adversarial attacks and privacy preservation for images, a conclusive study of this nature has not been conducted for the audio domain. Therefore, this paper aims to examine existing approaches for privacy-preserving and privacy-attacking strategies for audio and speech. To achieve this goal, we classify the attack and defense scenarios into several categories and provide detailed analysis of each approach. We also interpret the dissimilarities between the various approaches, highlight their contributions, and examine their limitations. Our investigation reveals that voice-controlled devices based on neural networks are inherently susceptible to specific types of attacks. Although it is possible to enhance the robustness of such models to certain forms of attack, more sophisticated approaches are required to comprehensively safeguard user privacy

    Measuring context dependency in birdsong using artificial neural networks

    Get PDF
    Context dependency is a key feature in sequential structures of human language, which requires reference between words far apart in the produced sequence. Assessing how long the past context has an effect on the current status provides crucial information to understand the mechanism for complex sequential behaviors. Birdsongs serve as a representative model for studying the context dependency in sequential signals produced by non-human animals, while previous reports were upper-bounded by methodological limitations. Here, we newly estimated the context dependency in birdsongs in a more scalable way using a modern neural-network-based language model whose accessible context length is sufficiently long. The detected context dependency was beyond the order of traditional Markovian models of birdsong, but was consistent with previous experimental investigations. We also studied the relation between the assumed/auto-detected vocabulary size of birdsong (i.e., fine- vs. coarse-grained syllable classifications) and the context dependency. It turned out that the larger vocabulary (or the more fine-grained classification) is assumed, the shorter context dependency is detected

    Implementation of a distributed real-time video panorama pipeline for creating high quality virtual views

    Get PDF
    Today, we are continuously looking for more immersive video systems. Such systems, however, require more content, which can be costly to produce. A full panorama, covering regions of interest, can contain all the information required, but can be difficult to view in its entirety. In this thesis, we discuss a method for creating virtual views from a cylindrical panorama, allowing multiple users to create individual virtual cameras from the same panorama video. We discuss how this method can be used for video delivery, but emphasize on the creation of the initial panorama. The panorama must be created in real-time, and with very high quality. We design and implement a prototype recording pipeline, installed at a soccer stadium, as a part of the Bagadus project. We describe a pipeline capable of producing 4K panorama videos from five HD cameras, in real-time, with possibilities for further upscaling. We explain how the cylindrical panorama can be created, with minimal computational cost and without visible seams. The cameras of our prototype system record video in the incomplete Bayer format, and we also investigate which debayering algorithms are best suited for recording multiple high resolution video streams in real-time

    An Investigation Into the Feasibility of Streamlining Language Sample Analysis Through Computer-Automated Transcription and Scoring

    Get PDF
    The purpose of the study was to investigate the feasibility of streamlining the transcription and scoring portion of language sample analysis (LSA) through computer-automation. LSA is a gold-standard procedure for examining childrens’ language abilities that is underutilized by speech language pathologists due to its time-consuming nature. To decrease the time associated with the process, the accuracy of transcripts produced automatically with Google Cloud Speech and the accuracy of scores generated by a hard-coded scoring function called the Literate Language Use in Narrative Analysis (LLUNA) were evaluated. A collection of narrative transcripts and audio recordings of narrative samples were selected to evaluate the accuracy of these automated systems. Samples were previously elicited from school-age children between the ages of 6;0-11;11 who were either typically developing (TD), at-risk for language-related learning disabilities (AR), or had developmental language disorder (DLD). Transcription error of Google Cloud Speech transcripts was evaluated with a weighted word-error rate (WERw). Score accuracy was evaluated with a quadratic weighted kappa (Kqw). Results indicated an average WERw of 48% across all language sample recordings, with a median WERw of 40%. Several recording characteristics of samples were associated with transcription error including the codec used to recorded the audio sample and the presence of background noise. Transcription error was lower on average for samples collected using a lossless codec, that contained no background noise. Scoring accuracy of LLUNA was high across all six measures of literate language when generated from traditionally produced transcripts, regardless of age or language ability (TD, DLD, AR). Adverbs were most variable in their score accuracy. Scoring accuracy dropped when LLUNA generated scores from transcripts produced by Google Cloud Speech, however, LLUNA was more likely to generate accurate scores when transcripts had low to moderate levels of transcription error. This work provides additional support for the use of automated transcription under the right recording conditions and automated scoring of literate language indices. It also provides preliminary support for streamlining the entire LSA process by automating both transcription and scoring, when high quality recordings of language samples are utilized
    corecore