Search CORE

5,957 research outputs found

A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition

Author: Bellet Aurelien
Collins Michael
Fan Linxi
Garakani Alireza Bagheri
Guo Dong
Kingsbury Brian
Liu Kuan
Lu Zhiyun
May Avner
Picheny Michael
Sha Fei
Publication venue
Publication date: 18/03/2016
Field of study

We study large-scale kernel methods for acoustic modeling and compare to DNNs on performance metrics related to both acoustic modeling and recognition. Measuring perplexity and frame-level classification accuracy, kernel-based acoustic models are as effective as their DNN counterparts. However, on token-error-rates DNN models can be significantly better. We have discovered that this might be attributed to DNN's unique strength in reducing both the perplexity and the entropy of the predicted posterior probabilities. Motivated by our findings, we propose a new technique, entropy regularized perplexity, for model selection. This technique can noticeably improve the recognition performance of both types of models, and reduces the gap between them. While effective on Broadcast News, this technique could be also applicable to other tasks.Comment: arXiv admin note: text overlap with arXiv:1411.400

arXiv.org e-Print Archive

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Recommended from our members

Kernel Approximation Methods for Speech Recognition

Author: May Avner
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

Over the past five years or so, deep learning methods have dramatically improved the state of the art performance in a variety of domains, including speech recognition, computer vision, and natural language processing. Importantly, however, they suffer from a number of drawbacks: 1. Training these models is a non-convex optimization problem, and thus it is difficult to guarantee that a trained model minimizes the desired loss function. 2. These models are difficult to interpret. In particular, it is difficult to explain, for a given model, why the computations it performs make accurate predictions. In contrast, kernel methods are straightforward to interpret, and training them is a convex optimization problem. Unfortunately, solving these optimization problems exactly is typically prohibitively expensive, though one can use approximation methods to circumvent this problem. In this thesis, we explore to what extent kernel approximation methods can compete with deep learning, in the context of large-scale prediction tasks. Our contributions are as follows: 1. We perform the most extensive set of experiments to date using kernel approximation methods in the context of large-scale speech recognition tasks, and compare performance with deep neural networks. 2. We propose a feature selection algorithm which significantly improves the performance of the kernel models, making their performance competitive with fully-connected feedforward neural networks. 3. We perform an in-depth comparison between two leading kernel approximation strategies — random Fourier features [Rahimi and Recht, 2007] and the Nyström method [Williams and Seeger, 2001] — showing that although the Nyström method is better at approximating the kernel, it performs worse than random Fourier features when used for learning. We believe this work opens the door for future research to continue to push the boundary of what is possible with kernel methods. This research direction will also shed light on the question of when, if ever, deep models are needed for attaining strong performance

Columbia University Academic Commons

Predicting Audio Advertisement Quality

Author: Bandiera Giuseppe
Böck Sebastian
Glorot Xavier
Kim Youngmoo E
Kingma Diederik
Nieto Oriol
Prockup Matthew
Schörkhuber Christian
Seyerlehner Klaus
Witten Ian H
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/02/2018
Field of study

Online audio advertising is a particular form of advertising used abundantly in online music streaming services. In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement. Therefore, the automatic assessment of these ads is an important step toward audio ads ranking and better audio ads creation. In this paper we propose one way to measure the quality of the audio ads using a proxy metric called Long Click Rate (LCR), which is defined by the amount of time a user engages with the follow-up display ad (that is shown while the audio ad is playing) divided by the impressions. We later focus on predicting the audio ad quality using only acoustic features such as harmony, rhythm, and timbre of the audio, extracted from the raw waveform. We discuss how the characteristics of the sound can be connected to concepts such as the clarity of the audio ad message, its trustworthiness, etc. Finally, we propose a new deep learning model for audio ad quality prediction, which outperforms the other discussed models trained on hand-crafted features. To the best of our knowledge, this is the first large-scale audio ad quality prediction study.Comment: WSDM '18 Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 9 page

arXiv.org e-Print Archive

Crossref

Feature Learning from Spectrograms for Assessment of Personality Traits

Author: Attabi Yazid
Carbonneau Marc-André
Gagnon Ghyslain
Granger Eric
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/10/2016
Field of study

Several methods have recently been proposed to analyze speech and automatically infer the personality of the speaker. These methods often rely on prosodic and other hand crafted speech processing features extracted with off-the-shelf toolboxes. To achieve high accuracy, numerous features are typically extracted using complex and highly parameterized algorithms. In this paper, a new method based on feature learning and spectrogram analysis is proposed to simplify the feature extraction process while maintaining a high level of accuracy. The proposed method learns a dictionary of discriminant features from patches extracted in the spectrogram representations of training speech segments. Each speech segment is then encoded using the dictionary, and the resulting feature set is used to perform classification of personality traits. Experiments indicate that the proposed method achieves state-of-the-art results with a significant reduction in complexity when compared to the most recent reference methods. The number of features, and difficulties linked to the feature extraction process are greatly reduced as only one type of descriptors is used, for which the 6 parameters can be tuned automatically. In contrast, the simplest reference method uses 4 types of descriptors to which 6 functionals are applied, resulting in over 20 parameters to be tuned.Comment: 12 pages, 3 figure

arXiv.org e-Print Archive

A Subband-Based SVM Front-End for Robust ASR

Author: Ager Matthew
Cvetkovic Zoran
Sollich Peter
Yousafzai Jibran
Publication venue
Publication date: 24/12/2013
Field of study

This work proposes a novel support vector machine (SVM) based robust automatic speech recognition (ASR) front-end that operates on an ensemble of the subband components of high-dimensional acoustic waveforms. The key issues of selecting the appropriate SVM kernels for classification in frequency subbands and the combination of individual subband classifiers using ensemble methods are addressed. The proposed front-end is compared with state-of-the-art ASR front-ends in terms of robustness to additive noise and linear filtering. Experiments performed on the TIMIT phoneme classification task demonstrate the benefits of the proposed subband based SVM front-end: it outperforms the standard cepstral front-end in the presence of noise and linear filtering for signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed front-end with a conventional front-end such as MFCC yields further improvements over the individual front ends across the full range of noise levels

arXiv.org e-Print Archive

King's Research Portal

Proceedings of the second "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'14)

Author: Absil P. -A.
Anthoine S.
Bertin N.
Bilen C.
Boumal N.
Boursier Y.
Bundervoet S.
Cambareri V.
Chabiron O.
Chainais P.
Cornelis B.
Dankova M.
Daubechies I.
Daudet L.
Davies M.
De Mol C.
De Vleeschouwer C.
Degraux K.
Determe J. -F.
Dobigeon N.
Dooms A.
Drémeau A.
Dunson D.
Duval V.
Fadili J.
Fawzi A.
Frossard P.
Geelen B.
Gigan S.
Gillis N.
Golbabaee M.
Gribonval R.
Heas P.
Herzet C.
Horlin F.
Jacques L.
Kitic S.
Lafruit G.
Liang J.
Liutkus A.
Loris I.
Louveaux J.
Maggioni M.
Magoarou L. Le
Malgouyres F.
Martina D.
Minsker S.
Mishra B.
Mory C.
Ngole F.
Peyré G.
Pizurica A.
Rajmic P.
Richard C.
Schelkens P.
Schretter C.
Sepulchre R.
Setti G.
Soussen C.
Starck J. -L.
Strawn N.
Sudhakar P.
Tourneret J. -Y.
Vaiter S.
Vandergheynst P.
Vavasis S. A.
Vukobratovic D.
Publication venue
Publication date: 01/10/2014
Field of study

The implicit objective of the biennial "international - Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST) is to foster collaboration between international scientific teams by disseminating ideas through both specific oral/poster presentations and free discussions. For its second edition, the iTWIST workshop took place in the medieval and picturesque town of Namur in Belgium, from Wednesday August 27th till Friday August 29th, 2014. The workshop was conveniently located in "The Arsenal" building within walking distance of both hotels and town center. iTWIST'14 has gathered about 70 international participants and has featured 9 invited talks, 10 oral presentations, and 14 posters on the following themes, all related to the theory, application and generalization of the "sparsity paradigm": Sparsity-driven data sensing and processing; Union of low dimensional subspaces; Beyond linear and convex inverse problem; Matrix/manifold/graph sensing/processing; Blind inverse problems and dictionary learning; Sparsity and computational neuroscience; Information theory, geometry and randomness; Complexity/accuracy tradeoffs in numerical methods; Sparsity? What's next?; Sparse machine learning and inference.Comment: 69 pages, 24 extended abstracts, iTWIST'14 website: http://sites.google.com/site/itwist1

arXiv.org e-Print Archive

Edinburgh Research Explorer