413 research outputs found
Learning sound representations using trainable COPE feature extractors
Sound analysis research has mainly been focused on speech and music
processing. The deployed methodologies are not suitable for analysis of sounds
with varying background noise, in many cases with very low signal-to-noise
ratio (SNR). In this paper, we present a method for the detection of patterns
of interest in audio signals. We propose novel trainable feature extractors,
which we call COPE (Combination of Peaks of Energy). The structure of a COPE
feature extractor is determined using a single prototype sound pattern in an
automatic configuration process, which is a type of representation learning. We
construct a set of COPE feature extractors, configured on a number of training
patterns. Then we take their responses to build feature vectors that we use in
combination with a classifier to detect and classify patterns of interest in
audio signals. We carried out experiments on four public data sets: MIVIA audio
events, MIVIA road events, ESC-10 and TU Dortmund data sets. The results that
we achieved (recognition rate equal to 91.71% on the MIVIA audio events, 94% on
the MIVIA road events, 81.25% on the ESC-10 and 94.27% on the TU Dortmund)
demonstrate the effectiveness of the proposed method and are higher than the
ones obtained by other existing approaches. The COPE feature extractors have
high robustness to variations of SNR. Real-time performance is achieved even
when the value of a large number of features is computed.Comment: Accepted for publication in Pattern Recognitio
How Tiny Can Analog Filterbank Features Be Made for Ultra-low-power On-device Keyword Spotting?
Analog feature extraction is a power-efficient and re-emerging signal
processing paradigm for implementing the front-end feature extractor in on
device keyword-spotting systems. Despite its power efficiency and re-emergence,
there is little consensus on what values the architectural parameters of its
critical block, the analog filterbank, should be set to, even though they
strongly influence power consumption. Towards building consensus and
approaching fundamental power consumption limits, we find via simulation that
through careful selection of its architectural parameters, the power of a
typical state-of-the-art analog filterbank could be reduced by 33.6x, while
sacrificing only 1.8% in downstream 10-word keyword spotting accuracy through a
back-end neural network.Comment: Accepted as a full paper by the TinyML Research Symposium 202
Embedded Knowledge-based Speech Detectors for Real-Time Recognition Tasks
Speech recognition has become common in many application domains, from dictation systems for professional practices to vocal user interfaces for people with disabilities or hands-free system control. However, so far the performance of automatic speech recognition (ASR) systems are comparable to human speech recognition (HSR) only under very strict working conditions, and in general much lower. Incorporating acoustic-phonetic knowledge into ASR design has been proven a viable approach to raise ASR accuracy. Manner of articulation attributes such as vowel, stop, fricative, approximant, nasal, and silence are examples of such knowledge. Neural networks have already been used successfully as detectors for manner of articulation attributes starting from representations of speech signal frames. In this paper, the full system implementation is described. The system has a first stage for MFCC extraction followed by a second stage implementing a sinusoidal based multi-layer perceptron for speech event classification. Implementation details over a Celoxica RC203 board are give
A 23μW Solar-Powered Keyword-Spotting ASIC with Ring-Oscillator-Based Time-Domain Feature Extraction
Voice-controlled interfaces on acoustic Internet-of-Things (IoT) sensor nodes and mobile devices require integrated low-power always-on wake-up functions such as Voice Activity Detection (VAD) and Keyword Spotting (KWS) to ensure longer battery life. Most VAD and KWS ICs focused on reducing the power of the feature extractor (FEx) as it is the most power-hungry building block. A serial Fast Fourier Transform (FFT)-based KWS chip [1] achieved 510nW; however, it suffered from a high 64ms latency and was limited to detection of only 1-to-4 keywords (2-to-5 classes). Although the analog FEx [2]–[3] for VAD/KWS reported 0.2μW-to-1 μW and 10ms-to-100ms latency, neither demonstrated >5 classes in keyword detection. In addition, their voltage-domain implementations cannot benefit from process scaling because the low supply voltage reduces signal swing; and the degradation of intrinsic gain forces transistors to have larger lengths and poor linearity
- …