146,325 research outputs found
Acoustic analysis of Sindhi speech - a pre-curser for an ASR system
The functional and formative properties of speech sounds are usually referred to as acoustic-phonetics in linguistics. This research aims to demonstrate acoustic-phonetic features of the elemental sounds of Sindhi, which is a branch of the Indo-European family of languages mainly spoken in the Sindh province of Pakistan and in some parts of India. In addition to the available articulatory-phonetic knowledge; acoustic-phonetic knowledge has been classified for the identification and classification of Sindhi language sounds. Determining the acoustic features of the language sounds helps to bring together the sounds with similar acoustic characteristics under the name of one natural class of meaningful phonemes. The obtained acoustic features and corresponding statistical results for a particular natural class of phonemes provides a clear understanding of the meaningful phonemes of Sindhi and it also helps to eliminate redundant sounds present in the inventory. At present Sindhi includes nine redundant, three interchanging, three substituting, and three confused pairs of consonant sounds. Some of the unique acoustic-phonetic features of Sindhi highlighted in this study are determining the acoustic features of the large number of the contrastive voiced implosives of Sindhi and the acoustic impact of the language flexibility in terms of the insertion and digestion of the short vowels in the utterance. In addition to this the issue of the presence of the affricate class of sounds and the diphthongs in Sindhi is addressed. The compilation of the meaningful language phoneme set by learning their acoustic-phonetic features serves one of the major goals of this study; because twelve such sounds of Sindhi are studied that are not yet part of the language alphabet. The main acoustic features learned for the phonological structures of Sindhi are the fundamental frequency, formants, and the duration — along with the analysis of the obtained acoustic waveforms, the formant tracks and the computer generated spectrograms. The impetus for doing such research comes from the fact that detailed knowledge of the sound characteristics of the language-elements has a broad variety of applications — from developing accurate synthetic speech production systems to modeling robust speaker-independent speech recognizers. The major research achievements and contributions this study provides in the field include the compilation and classification of the elemental sounds of Sindhi. Comprehensive measurement of the acoustic features of the language sounds; suitable to be incorporated into the design of a Sindhi ASR system. Understanding of the dialect specific acoustic variation of the elemental sounds of Sindhi. A speech database comprising the voice samples of the native Sindhi speakers. Identification of the language‘s redundant, substituting and interchanging pairs of sounds. Identification of the language‘s sounds that can potentially lead to the segmentation and recognition errors for a Sindhi ASR system design. The research achievements of this study create the fundamental building blocks for future work to design a state-of-the-art prototype, which is: gender and environment independent, continuous and conversational ASR system for Sindhi
Chimpanzee vowel-like sounds and voice quality suggest formant space expansion through the hominoid lineage
The origins of human speech are obscure; it is still unclear what aspects are unique to our species or shared with our evolutionary cousins, in part due to a lack of common framework for comparison. We asked what chimpanzee and human vocal production acoustics have in common. We examined visible supra-laryngeal articulators of four major chimpanzee vocalizations (hoos, grunts, barks, screams) and their associated acoustic structures, using techniques from human phonetic and animal communication analysis. Data were collected from wild adult chimpanzees, Taï National Park, Ivory Coast. Both discriminant and principal component classification procedures revealed classification of call types. Discriminating acoustic features include voice quality and formant structure, mirroring phonetic features in human speech. Chimpanzee lip and jaw articulation variables also offered similar discrimination of call types. Formant maps distinguished call types with different vowel-like sounds. Comparing our results with published primate data, humans show less F1–F2 correlation and further expansion of the vowel space, particularly for [i] sounds. Unlike recent studies suggesting monkeys achieve human vowel space, we conclude from our results that supra-laryngeal articulatory capacities show moderate evolutionary change, with vowel space expansion continuing through hominoid evolution. Studies on more primate species will be required to substantiate this.This article is part of the theme issue ‘Voice modulation: from origin and mechanism to social impact (Part II)’
Deep Room Recognition Using Inaudible Echos
Recent years have seen the increasing need of location awareness by mobile
applications. This paper presents a room-level indoor localization approach
based on the measured room's echos in response to a two-millisecond single-tone
inaudible chirp emitted by a smartphone's loudspeaker. Different from other
acoustics-based room recognition systems that record full-spectrum audio for up
to ten seconds, our approach records audio in a narrow inaudible band for 0.1
seconds only to preserve the user's privacy. However, the short-time and
narrowband audio signal carries limited information about the room's
characteristics, presenting challenges to accurate room recognition. This paper
applies deep learning to effectively capture the subtle fingerprints in the
rooms' acoustic responses. Our extensive experiments show that a two-layer
convolutional neural network fed with the spectrogram of the inaudible echos
achieve the best performance, compared with alternative designs using other raw
data formats and deep models. Based on this result, we design a RoomRecognize
cloud service and its mobile client library that enable the mobile application
developers to readily implement the room recognition functionality without
resorting to any existing infrastructures and add-on hardware.
Extensive evaluation shows that RoomRecognize achieves 99.7%, 97.7%, 99%, and
89% accuracy in differentiating 22 and 50 residential/office rooms, 19 spots in
a quiet museum, and 15 spots in a crowded museum, respectively. Compared with
the state-of-the-art approaches based on support vector machine, RoomRecognize
significantly improves the Pareto frontier of recognition accuracy versus
robustness against interfering sounds (e.g., ambient music).Comment: 29 page
A Robust Interpretable Deep Learning Classifier for Heart Anomaly Detection Without Segmentation
Traditionally, abnormal heart sound classification is framed as a three-stage
process. The first stage involves segmenting the phonocardiogram to detect
fundamental heart sounds; after which features are extracted and classification
is performed. Some researchers in the field argue the segmentation step is an
unwanted computational burden, whereas others embrace it as a prior step to
feature extraction. When comparing accuracies achieved by studies that have
segmented heart sounds before analysis with those who have overlooked that
step, the question of whether to segment heart sounds before feature extraction
is still open. In this study, we explicitly examine the importance of heart
sound segmentation as a prior step for heart sound classification, and then
seek to apply the obtained insights to propose a robust classifier for abnormal
heart sound detection. Furthermore, recognizing the pressing need for
explainable Artificial Intelligence (AI) models in the medical domain, we also
unveil hidden representations learned by the classifier using model
interpretation techniques. Experimental results demonstrate that the
segmentation plays an essential role in abnormal heart sound classification.
Our new classifier is also shown to be robust, stable and most importantly,
explainable, with an accuracy of almost 100% on the widely used PhysioNet
dataset
Optimal Representation of Anuran Call Spectrum in Environmental Monitoring Systems Using Wireless Sensor Networks
The analysis and classification of the sounds produced by certain animal species, notably anurans, have revealed these amphibians to be a potentially strong indicator of temperature fluctuations and therefore of the existence of climate change. Environmental monitoring systems using Wireless Sensor Networks are therefore of interest to obtain indicators of global warming. For the automatic classification of the sounds recorded on such systems, the proper representation of the sound spectrum is essential since it contains the information required for cataloguing anuran calls. The present paper focuses on this process of feature extraction by exploring three alternatives: the standardized MPEG-7, the Filter Bank Energy (FBE), and the Mel Frequency Cepstral Coefficients (MFCC). Moreover, various values for every option in the extraction of spectrum features have been considered. Throughout the paper, it is shown that representing the frame spectrum with pure FBE offers slightly worse results than using the MPEG-7 features. This performance can easily be increased, however, by rescaling the FBE in a double dimension: vertically, by taking the logarithm of the energies; and, horizontally, by applying mel scaling in the filter banks. On the other hand, representing the spectrum in the cepstral domain, as in MFCC, has shown additional marginal improvements in classification performance.University of Seville: Telefónica Chair "Intelligence Networks
Factors in human recognition of timbre lexicons generated by data clustering
Since the development of sound recording technologies, the palette of sound timbres available for music creation was extended way beyond traditional musical instruments. The organization and categorization of timbre has been a common endeavor. The availability of large databases of sound clips provides an opportunity for obtaining datadriven timbre categorizations via content-based clustering. In this article we describe an experiment aimed at understanding what factors influence the process of learning a given clustering of sound samples. We clustered a large database of short sound clips, and analyzed the success of participants in assigning sounds to the “correct” clusters after listening to a few examples of each. The results of the experiment suggest a number of relevant factors related both to the strategies followed by users and to the quality measures of the clustering solution, which can guide the design of creative applications based on audio clip clustering
Temporally-aware algorithms for the classification of anuran sounds
Several authors have shown that the sounds of anurans can be used as an indicator of
climate change. Hence, the recording, storage and further processing of a huge
number of anuran sounds, distributed over time and space, are required in order to
obtain this indicator. Furthermore, it is desirable to have algorithms and tools for
the automatic classification of the different classes of sounds. In this paper, six
classification methods are proposed, all based on the data-mining domain, which
strive to take advantage of the temporal character of the sounds. The definition and
comparison of these classification methods is undertaken using several approaches.
The main conclusions of this paper are that: (i) the sliding window method attained
the best results in the experiments presented, and even outperformed the hidden
Markov models usually employed in similar applications; (ii) noteworthy overall
classification performance has been obtained, which is an especially striking result
considering that the sounds analysed were affected by a highly noisy background;
(iii) the instance selection for the determination of the sounds in the training dataset
offers better results than cross-validation techniques; and (iv) the temporally-aware
classifiers have revealed that they can obtain better performance than their nontemporally-aware
counterparts.Consejería de Innovación, Ciencia y Empresa (Junta de Andalucía, Spain): excellence eSAPIENS number TIC 570
- …