Search CORE

2 research outputs found

Optimizing Vocabulary Modeling for Dysarthric Voice User Interface

Author: 나민수
Publication venue: 서울대학교 대학원
Publication date: 01/02/2016
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 협동과정 인지과학전공, 2016. 2. 정민화.마비말장애 화자의 음성 인터페이스 활용에서 빈번한 조음오류, 눌변, 빈번하고 불규칙적으로 발생하는 발화중단, 느린 발화속도 등은 오인식을 유발하는 요인이 된다. 선행연구에서는 장애발화의 음향 및 음운적 특성을 분석하고 이를 기반으로 장애발화 수정, 음향모델 적응, 발음변이 모델링, 문법 및 어휘 모델링 등을 통해 인식오류의 문제를 보완하였다. 본 논문에서는 장애발화의 특징을 반영하여 음향모델을 최적화했다. 또한 어휘모델의 구성에서 음소범주 기반의 조음특징과 인식오류와의 관계를 모형화하여 단어의 선택기준으로 도입했고 단어 간 유사도를 줄임으로써 장애화자의 음성 인터페이스 시 발생하는 인식오류를 줄였다. 마비말장애 화자를 위한 음향모델의 구축을 위해 첫째로 장애화자의 느린 발화속도에 따라 특징추출의 윈도우 크기와 HMM를 구성하는 state 개수를 조정하여 오류를 낮췄다. 둘째로 HMM의 출력확률 모델로서 GMM, Subspace GMM, DNN 등을 도입하여 인식오류를 비교했다. 셋째로 학습데이터 부족문제에 대한 대응방법으로 정상발화 도입의 효율성을 인식실험으로 확인했다. 조음특징과 인식오류율의 혼합선형모델 분석에서 자음범주 중 오류율과 유의수준 0.05 이하에서 상관관계를 보인 범주는 마찰음과 비음이고 모든 모음범주는 오류율과 유의수준 0.05 이하에서 상관관계를 보였다. 또한 혼합모델은 자음을 조음방법으로 범주화할 때가 조음위치로 범주화할 때에 비해 낮은 AIC를 보였고 모음을 혀의 위치로 범주화할 때가 낮은 AIC를 보였다. 음소를 조음방법과 혀의 위치로 범주화했을 때 마찰음은 단어에 포함될수록 인식오류를 높이는 결과를 보였고 비음은 인식정확도를 높였다. 폐쇄음, 파찰음, 유음 등은 인식결과에 대한 영향이 0에 가까웠다. 모든 모음 범주는 인식정확도를 높였다. 그 중 중설모음의 영향력이 가장 컸고 후설모음, 전설모음, 이중모음 순으로 작아졌다. 단어 간 인식오류의 유발 가능성을 Levenshtein 거리와 Cosine 거리 등 열 거리 기준의 단어 간 유사도로 모델링했고 유사도의 최소-최대 비교와 N-best 추정으로 정의된 단어간 유사도가 음성인식에 영향을 주는 요인이 될 수 있음을 확인했다. 장애발화 인식을 위한 어휘모델의 구성에서 먼저 조음특징 기반의 혼합모델로 단어의 조음점수를 계산하고 점수를 최대화하는 단어로 인식단어 리스트를 만들었다. 또한 단어 간 유사도를 최소화하도록 어휘모델을 수정하여 실험한 결과 기존 통화표 방식에 비해 절대적으로 5.7%, 상대적으로 34.6%의 인식오류가 감소하였다.Articulation errors, disfluency, impulsive pause, low speaking rate have been suggested as factors of recognition error for dysarthric speakers using voice user interface. In related works, methods for correcting dysarthric speech, AM adaptation, pronunciation variation modeling, grammar modeling and vocabulary modeling based on acoustic and phonetic analyses on dysarthric speech were proposed to compensate those factors. In this paper, acoustic model was optimized. And words in the vocabulary were selected by the GLMM which had modeled relationship between recognition errors and articulatory features for phonetic class and optimized by lowering similarity between words. Problems in training AM for dysarthric speech recognition were addressed: firstly low speaking rate was compensated by varying the window length of FFT and the number of states of HMM. Secondly the efficiency of models for emission probability of HMM was compared. Thirdly AM trained using large amount of non-dysarthric speech was experimented. Fricative and nasal were statistically significant in the analysis of relation between recognition error and consonant classes. And all vowel classes were significant. AIC was lower by classifying consonants based on manner of articulation than based on place and by classifying vowels based on position of tongue than based on height. Fricatives increased WER and nasals increased accuracy of recognition. Estimates of plosive, affricate, liquid were close to zero. All vowel classes increased accuracy. Estimate of central vowel was the largest followed by back vowel, front vowel and diphthong. Triggering recognition error by competitive words was modeled by similarity between words based on Levenshtein and cosine distance respectively. Effect of similarity between words on recognition result was confirmed by the minimum-maximum similarity contrast and the N-best prediction. Prior to model vocabulary, articulation score for each word was calculated. Firstly the vocabulary was composed of the words with the maximum articulation scores. Secondly the words in the vocabulary with high similarity were replaced by the word with less similarity and large articulation score. In dysarthric speech recognitions, the optimized vocabulary lowered WER by 5.72% (34.60% ERR).제 1 장 서론 1 제 2 장 관련연구 5 제 1 절 마비말장애 발화 조음오류 분석 5 제 2 절 음성인식 시스템의 구조와 모델 9 2.1 특징추출 12 2.2 음향모델 12 2.3 발음모델 16 2.4 언어모델 18 제 3 절 음성인식시스템의 활용분야 18 제 4 절 마비말장애 발화 인식 22 제 3 장 특징추출 및 음향모델 베이스라인 구축 27 제 1 절 접근방법 28 제 2 절 개발 환경 29 2.1 음성 데이터 29 2.2 음성인식 환경 및 음향모델 학습절차 30 제 3 절 인식실험 32 3.1 발화속도 모델링 32 3.2 출력확률 모델 파라미터 최적화 38 3.3 음향모델 학습 데이터의 구성 44 3.4 실험결과 분석 및 요약 46 제 4 절 결론 48 제 4 장 조음오류 특성 기반 인식단어 선택기준 49 제 1 절 한국어 음소 정의와 조음 특징에 의한 범주화 50 제 2 절 연구 목표 53 제 3 절 분석 데이터 53 3.1 음성 데이터 53 3.2 음소의 범주 및 통계 54 3.3 음성인식 56 3.4 데이터 분석 57 제 4 절 분석 결과 59 제 5 절 결론 62 제 5 장 단어 간 유사도 최소화 기준 인식단어 최적화 63 제 1 절 열 거리 기반의 단어 간 유사도 65 제 2 절 인식률에 대한 단어 간 유사도의 영향 67 제 3 절 N-best 추정 68 제 6 장 인식실험 74 제 1 절 단어 리스트 구성 74 1.1 베이스라인 인식 단어 리스트 76 1.2 조음점수 최대화 단어 리스트 77 1.3 단어 간 유사도 최소화 단어 리스트 77 제 2 절 기초 실험 78 제 3 절 실험 환경 81 3.1 음성 코퍼스 구성 81 제 4 절 인식 결과 84 4.1 베이스라인 모델 84 4.2 조음점수 최대화 모델 84 4.3 단어 간 유사도 최소화 모델 86 제 7 장 결론 88 제 1 절 연구결과 요약 및 평가 88 제 2 절 기여도 요약 90 제 3 절 향후 연구 91 참고 문헌 94 부록 101 Abstract 114Docto

SNU Open Repository and Archive