Search CORE

16,273 research outputs found

Two-Dimensional Convolutional Recurrent Neural Networks for Speech Activity Detection

Author: Chen Liming
Fanioudakis Eleftherios
Giakoumis Dimitrios
Hamzaoui Raouf
Potamitis Ilyas
Tzovaras Dimitrios
Vafeiadis Anastasios
Votis Konstantinos
Publication venue: 'International Speech Communication Association'
Publication date: 17/06/2019
Field of study

Speech Activity Detection (SAD) plays an important role in mobile communications and automatic speech recognition (ASR). Developing efficient SAD systems for real-world applications is a challenging task due to the presence of noise. We propose a new approach to SAD where we treat it as a two-dimensional multilabel image classification problem. To classify the audio segments, we compute their Short-time Fourier Transform spectrograms and classify them with a Convolutional Recurrent Neural Network (CRNN), traditionally used in image recognition. Our CRNN uses a sigmoid activation function, max-pooling in the frequency domain, and a convolutional operation as a moving average filter to remove misclassified spikes. On the development set of Task 1 of the 2019 Fearless Steps Challenge, our system achieved a decision cost function (DCF) of 2.89%, a 66.4% improvement over the baseline. Moreover, it achieved a DCF score of 3.318% on the evaluation dataset of the challenge, ranking first among all submissions

Crossref

De Montfort University Open Research Archive

Denoising Deep Neural Networks Based Voice Activity Detection

Author: Wu Ji
Zhang Xiao-Lei
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/03/2013
Field of study

Recently, the deep-belief-networks (DBN) based voice activity detection (VAD) has been proposed. It is powerful in fusing the advantages of multiple features, and achieves the state-of-the-art performance. However, the deep layers of the DBN-based VAD do not show an apparent superiority to the shallower layers. In this paper, we propose a denoising-deep-neural-network (DDNN) based VAD to address the aforementioned problem. Specifically, we pre-train a deep neural network in a special unsupervised denoising greedy layer-wise mode, and then fine-tune the whole network in a supervised way by the common back-propagation algorithm. In the pre-training phase, we take the noisy speech signals as the visible layer and try to extract a new feature that minimizes the reconstruction cross-entropy loss between the noisy speech signals and its corresponding clean speech signals. Experimental results show that the proposed DDNN-based VAD not only outperforms the DBN-based VAD but also shows an apparent performance improvement of the deep layers over shallower layers.Comment: This paper has been accepted by IEEE ICASSP-2013, and will be published online after May, 201

arXiv.org e-Print Archive

Crossref

Deep Learning for Audio Signal Processing

Author: Chang Shuo-yiin
Li Bo
Purwins Hendrik
Sainath Tara
Schlüter Jan
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

arXiv.org e-Print Archive

VBN

Predicting continuous conflict perception with Bayesian Gaussian processes

Author: Filippone Maurizio
Kim Samuel
Valente Fabio
Vinciarelli Alessandro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Conflict is one of the most important phenomena of social life, but it is still largely neglected by the computing community. This work proposes an approach that detects common conversational social signals (loudness, overlapping speech, etc.) and predicts the conflict level perceived by human observers in continuous, non-categorical terms. The proposed regression approach is fully Bayesian and it adopts Automatic Relevance Determination to identify the social signals that influence most the outcome of the prediction. The experiments are performed over the SSPNet Conflict Corpus, a publicly available collection of 1430 clips extracted from televised political debates (roughly 12 hours of material for 138 subjects in total). The results show that it is possible to achieve a correlation close to 0.8 between actual and predicted conflict perception

Crossref

Enlighten

Anomaly Detection in Network Streams Through a Distributional Lens

Author: Arackaparambil Chrisil
Publication venue: Dartmouth Digital Commons
Publication date: 01/09/2011
Field of study

Anomaly detection in computer networks yields valuable information on events relating to the components of a network, their states, the users in a network and their activities. This thesis provides a unified distribution-based methodology for online detection of anomalies in network traffic streams. The methodology is distribution-based in that it regards the traffic stream as a time series of distributions (histograms), and monitors metrics of distributions in the time series. The effectiveness of the methodology is demonstrated in three application scenarios. First, in 802.11 wireless traffic, we show the ability to detect certain classes of attacks using the methodology. Second, in information network update streams (specifically in Wikipedia) we show the ability to detect the activity of bots, flash events, and outages, as they occur. Third, in Voice over IP traffic streams, we show the ability to detect covert channels that exfiltrate confidential information out of the network. Our experiments show the high detection rate of the methodology when compared to other existing methods, while maintaining a low rate of false positives. Furthermore, we provide algorithmic results that enable efficient and scalable implementation of the above methodology, to accomodate the massive data rates observed in modern infomation streams on the Internet. Through these applications, we present an extensive study of several aspects of the methodology. We analyze the behavior of metrics we consider, providing justification of our choice of those metrics, and how they can be used to diagnose anomalies. We provide insight into the choice of parameters, like window length and threshold, used in anomaly detection

Dartmouth Digital Commons (Dartmouth College)