5 research outputs found

    Age Estimation Based on Children’s Voice: A Fuzzy-Based Decision Fusion Strategy

    Get PDF
    Automatic estimation of a speaker’s age is a challenging research topic in the area of speech analysis. In this paper, a novel approach to estimate a speaker’s age is presented. The method features a “divide and conquer” strategy wherein the speech data are divided into six groups based on the vowel classes. There are two reasons behind this strategy. First, reduction in the complicated distribution of the processing data improves the classifier’s learning performance. Second, different vowel classes contain complementary information for age estimation. Mel-frequency cepstral coefficients are computed for each group and single layer feed-forward neural networks based on self-adaptive extreme learning machine are applied to the features to make a primary decision. Subsequently, fuzzy data fusion is employed to provide an overall decision by aggregating the classifier’s outputs. The results are then compared with a number of state-of-the-art age estimation methods. Experiments conducted based on six age groups including children aged between 7 and 12 years revealed that fuzzy fusion of the classifier’s outputs resulted in considerable improvement of up to 53.33% in age estimation accuracy. Moreover, the fuzzy fusion of decisions aggregated the complementary information of a speaker’s age from various speech sources

    Age Classification based on Machine Learning

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 인문대학 언어학과, 2018. 8. 정민화.본 연구는 대검찰청에서 수집한 한국인 대규모 음성 코퍼스를 사용하여 기계 학습 모델을 통해 연령을 분류하는 것을 목적으로 한다. 학습 모델은 20대, 30~40대, 50대 이상으로 3분류를 한다. 실험을 위해 묵음(silence)을 기준으로 음성 코퍼스를 378,684개의 데이터로 분절하였으며, 발화 유형과 성별로 데이터를 구분하였다. 음성으로부터 Mel Frequency Cepstral Coefficients(MFCCs), fundamental frequency(F0), i-vector, jitter, shimmer, 발화속도를 추출하여 기계 학습 모델인 Long Short Term Memory(LSTM) 모델을 통해 연령을 분류하였다. 또한, feature selection 알고리즘을 통해 각 음성 특징의 영향을 확인하여 특징마다 가중치를 달리한 실험도 진행하였다. 실험에서는 음성 특징별 성능과 음성 특징의 조합의 성능으로 나누어 험하였다. 그 결과, 개별 음성 특징의 경우 MFCC로 학습하였을 때 76.01%로 가장 높았으며, 음성 특징의 조합의 경우 모든 음성 특징을 학습하였을 때 80.01%로 가장 높았다. 또한, Recursive Feature Elimination (RFE)나 Extra Tree Classifier (ETC)와 같은 feature selection 알고리즘을 적용하였을 때는 80.87%로 본 연구에서 가장 높은 성능을 보였다.1. 서론 1 2. 선행 연구 3 3. 연구 방법 8 3.1 음성 코퍼스 8 3.1.1 음성 코퍼스의 구성 8 3.1.2 데이터 분절 12 3.2 실험 모델 15 3.3 음성 특징 추출 17 3.3.1 Mel Frequency Cepstral Coefficients (MFCCs) 17 3.3.2 i-vector 21 3.3.3 Fundamental frequency (F0) 24 3.3.4 Jitter 25 3.3.5 Shimmer 27 3.3.6 발화속도 29 4. 실험 31 4.1 실험 설계 31 4.2 실험 결과 33 4.2.1 음성 특징별 성능 34 4.2.2 조합 성능 35 4.2.3 feature selection 적용 후 성능 36 4.3 토의 39 5. 결론 43 참고 문헌 44 부록 48 음성 특징별 혼동 행렬 48 음성 코퍼스 발화 유형 52 Abstract 59Maste

    A novel lip geometry approach for audio-visual speech recognition

    Get PDF
    By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset

    Phonetic and dialectal variation in phonologically contrastive laryngealisation : A case study of the Danish stød

    Get PDF
    A growing body of research is documenting the variability of the phonetic manifestation of phonological contrasts in voice quality. This thesis is an empirical addition to the typology of cross-linguistic phonetic variation in the production of phonological laryngealisation. Investigating this variation is done through a comparative lens of dialectal differences within a language and with the Laryngeal Articulator Model (Esling et al 2019) as a guiding framework for investigating laryngealisation. Dialectal differences are analysed using the Danish stød as a case study, comparing speakers from Copenhagen and Aarhus. The stød has previously been reported to be highly variable in its production, making it an excellent source of phonetic variation. Further, the dialect of Aarhus has been speculated to contain a tonal stød variant under certain conditions (Kyst 2004) which has not yet been studied phonetically, making the inclusion of this dialect novel both in a typological perspective and in the context of the phonetics of the stød specifically. Data from 10 speakers of Modern Standard Copenhagen and 11 speakers from Aarhus is analysed using both acoustic and articulatory measurements. Acoustic measures include fundamental frequency, intensity, the amplitude difference between the first and second harmonic (H1-H2), Cepstral Peak Prominence, Harmonics-to-Noise Ratio and Subharmonic-to-Harmonic Ratio. The articulatory measures are obtained via Electroglottography, a novel method in the context of investigating the stød, which enables the vocal fold contact patterns to be modelled for analysis. Three main research questions are explored: (i) how much gradient phonetic variation in voice quality occurs during the Danish stød (ii) which acoustic and articulatory measurements correlate with the subtypes of the stød and (iii) how the voice quality changes are timed. The different phonetic types of laryngealisation are categorised in reference to five subtypes of creaky voice described in Keating et al (2015). To accommodate the expected variation in the Aarhus dialect, the stød in divided into two types for analysis, regular and tonal, both elicited in a contrastive minimal pair. The differences are analysed dynamically using two different statistical methods, Generalised Additive Mixed Models and random forest models. The findings generally confirm that investigating dialectal differences when exploring types of contrastive non-modal phonation is a rich resource to draw upon in widening our empirical understanding of phonetic variation in phonological voice quality across different languages. They demonstrate that stød in Danish is not just one type of stød, and even the standard Copenhagen variant exhibits differences in acoustics and articulation based on its stød basis. The study of timing finds high variability but uncovers some general patterns according to stød type and dialects, solidifying timing as a rich source of phonetic differences. Lastly, the findings from this study support the notion that the larynx is an active articulator and that various structures within it can affect voice quality independent of activity in the glottis. This encourages more research into how these different laryngeal structures interplay and in what ways this affects commonly used acoustic and articulatory correlates of these mechanisms

    Über die Zusammenhänge zwischen Grundfrequenz und Vokalhöhe

    Get PDF
    Diese Dissertation geht von einem Zusammenhang zwischen der Grundfrequenz und der Perzeption von Vokalen, speziell der Höhe von Vokalen, aus - wie viele Vorgängerstudien auch - und diskutiert Konsequenzen, die sich aus diesem Umstand ergeben; außerdem führt sie neue Evidenzen an, dass unter bestimmten Bedingungen die Grundfrequenz auch zur Produktion von Vokalhöhendistinktionen aktiv variiert werden kann. In einer longitudinalen Studie wurden Aufnahmen aus mehreren Jahrzehnten, die von den selben britischen Sprechern stammten und auf Gleichwertigkeit der Kommunikationssituation kontrolliert worden waren, daraufhin untersucht, wie sich Alterungsprozesse in erwachsenen Sprechern auf die mittlere Grundfrequenz und die Formanten F1, F2 und F3 im Neutrallaut Schwa, bzw. auf die als äquivalent hierzu festgestellten gemittelten Formantwerte in allen stimmhaften Signalanteilen auswirken. Die Grundfrequenzen von Frauen werden als mit dem Alter fallend beschrieben, während Männer eine zunächst absinkende, später ansteigende Grundfrequenz aufweisen. Der zweite Formant ändert sich nur marginal, und auch F3 weist keine über alle Sprecher konsistenten, signifikanten Änderungen auf. Im Gegensatz hierzu ändert sich F1 mit zunehmendem Alter deutlich, und zwar bei den meisten Sprechern in die selbe Richtung wie die Grundfrequenz. In Daten eines Sprechers und einer Sprecherin, die in kurzen Abständen regelmäßig über ein halbes Jahrhundert hinweg aufgenommen worden waren, wird eine deutliche Kovariation des ersten Formanten mit der Grundfrequenz deutlich, wobei der Abstand zwischen F1 und Grundfrequenz auf einer logarithmischen Skala auch über Jahrzehnte hinweg relativ invariant bleibt. Die Hypothese hierzu ist, dass altersbedingte Formantänderungen weniger auf physiologisch bedingte Änderungen in den Abmessungen des Ansatzrohrs zurückzuführen seien, sondern auf eine kompensatorische Anpassung des ersten Formanten als Reaktion auf eine Perturbation des Vokalhöhenperzepts, welche hervorgerufen wird durch die (physiologisch bedingten) Grundfrequenzänderungen. Diese Hypothese schließt mit ein, dass das Vokalhöhenperzept der Sprecher/Hörer durch den in Relation zu f0 zu beurteilenden ersten Formanten bestimmt ist. Um diese letzte Schlussfolgerung weiter zu testen, wurden deutsche Sprecher in zwei Experimenten in Quasi-Echtzeit einem akustisch verändertem auditorischen Feedback ausgesetzt, und ihre akustischen Daten untersucht. Beide Perturbationen hatten das Ziel, das Vokalhöhenperzept (direkt oder indirekt) zu beeinflussen: Für eine Perturbation des ersten Formanten kompensierten die Sprecher mit einer F1-Produktion in Gegenrichtung zur Perturbation. Gleichzeitige Änderungen der produzierten Grundfrequenz sind teilweise als automatisch eintretende Kopplungseffekte zu deuten; unter bestimmten Bedingungen scheinen manche Sprecher jedoch f0 unabhängig von F1 aktiv zu variieren, um die intendierte Vokalhöhe zu erreichen. Bei einer Perturbation der Grundfrequenz variieren einige Sprecher den ersten Formanten dergestalt, dass zu vermuten ist, dass der aufgrund nur partiell durchgeführter f0-Kompensation weiterhin gegenüber den unperturbierten Werten veränderte F1-f0-Abstand das Vokalhöhenperzept beeinflusste, was zu einer kompensatorischen Gegenbewegung in Form einer Vokalhöhenvariierung führte. Ein Perzeptionsexperiment mit ausschließlich durch Grundfrequenzvariierung beeinflussten Kontinua zwischen vorderen halb-geschlossenen und geschlossenen Vokalen in Wörtern gleichen Kontexts, welche in Trägersätze eingebettet präsentiert wurden, ergab, dass die Grundfrequenzvariation nur etwa bei der Hälfte der deutschen Hörer das Vokalperzept beeinflusste. Das vokalintrinsische Merkmal wird aber trotz des störenden Einflusses extrinsischer Faktoren genutzt, und auch trotz der intonatorischen Funktion der Grundfrequenz. Die durch Ergebnisse von Untersuchungen zur Intrinsischen Grundfrequenz im Deutschen motivierte Hypothese, dass deutsche Hörer den F1-f0-Abstand als Vokalhöhenmerkmal in stärkerem Ausmaß in einem Kontinuum zwischen ungespannten Vokalen nutzen, als in einem Kontinuum zwischen gespannten Vokalen, konnte nicht bestätigt werden. Generell liefern alle drei experimentellen Teile dieser Dissertation weitere Evidenz dafür, dass - zumindest in den vergleichsweise vokalhöhenreichen Sprachen Englisch und Deutsch - viele, aber eben nicht alle Sprecher/Hörer zur Vokalhöhenperzeption und -produktion neben F1 auch die Grundfrequenz nutzen
    corecore