145 research outputs found

    Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

    Full text link
    In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers and the higher classification layers were jointly trained using the logits from the teacher model. In our experiments, compared to a baseline model trained on about 600 hours of transcribed data, a relative word-error rate (WER) reduction of about 27.3% was achieved when using an additional 1800 hours of untranscribed data. We also investigated the benefit of pre-training the multi-channel front end to output the beamformed log-mel filter bank energies (LFBE) using L2 loss. We find that pre-training improves the word error rate by 10.7% when compared to a multi-channel model directly initialized with a beamformer and mel-filter bank coefficients for the front end. Finally, combining pre-training and teacher-student training produces a WER reduction of 31% compared to our baseline.Comment: To appear in ICASSP 202

    Multi-stream gaussian mixture model based facial feature localization=Çoklu gauss karışım modeli tabanlı yüz öznitelikleri bulma algoritması

    Get PDF
    This paper presents a new facial feature localization system which estimates positions of eyes, nose and mouth corners simultaneously. In contrast to conventional systems, we use the multi-stream Gaussian mixture model (GMM) framework in order to represent structural and appearance information of facial features. We construct a GMM for the region of each facial feature, where the principal component analysis is used to extract each facial feature. We also build a GMM which represents the structural information of a face, relative positions of facial features. Those models are combined based on the multi-stream framework. It can reduce the computation time to search region of interest (ROI). We demonstrate the effectiveness of our algorithm through experiments on the BioID Face Database

    리극로의『실험도해 조선어 음성학』( 1949 년 11 월 , 평양)에 대한 소고

    Get PDF
    조선어 연구자 리극로(李克魯)는 1920 년대 유럽 유학 당시 , 최첨단의 음성학을 접한 후『실험도해 조선어 음성학』을 1947년 11월 15일 서울에서 간행했다(이하 ‘서울판’). 1948년 4 월 북조선으로 넘어간 그는 1949년 11 월 25일 조선어문연구회(평양)에서 개정 증보 판을 간행했다(이하 ‘평양판’). 리극로에 관한 선행연구에서는 단지 이 ‘평양판’ 이 출판되었 다는 사실만을 언급하고 있을뿐이어서 연구자들은 구체적인 내용에 대하여 잘 알지 못하는 것 같다 . 본 연구는 ‘평양판’ 에 새롭게 가필된 부분과 ‘서울판’ 에서 삭제된 부분에 초점을 맞추어 비교 · 대조하면서 , 두 문헌에 나타난 차이점을 밝히는데 그 목적이 있다 . 리극로가 파리 대학에서 조선어 음성연구의 피험자(被驗者)가 된 것에 대해 , ‘서울판’ 의 서문에 “ 1928 년 봄에 一個月 동안” 이라고 쓰여져 있지만 , ‘평양판’ 의 서문에는 “ 1928 년 3 월에 한 달 동안” 이라고 다시 고쳐져 있다 . 여타의 선행연구에서는 “ 1928 년 5 월” 이라 고 쓰여진 것도 보이는 등 , 검토를 필요로 하는 사항이다 . 리극로는 월북 후 , 평양의 국립영화촬영소의 녹음기와 체신성(遞信省)의 오실로그래프 를 이용하여 음성연구를 실시하였는데 , 그 성과가 ‘평양판’ 에 가필되어 있다 . 조선어 악센트 연구는 월북 작곡가 김순남(金順男)의 협력을 얻었다고 서문에 쓰여 있는데 , 경기도의 높 낮이 · 강약 악센트를 , 오선지를 사용하여 기술한 것은 그 영향일 수도 있다 . 또한 , 김순남은 서울 중심지의 낙원동 출신이었기에 , 리극로의 경기도 악센트 연구에 있어서 피험자이었을 지도 모른다 . ‘평양판’ 의 어휘 · 표현 등은 , 여러 곳에서 ‘서울판’ 을 수정한 것이며 , ‘평양판’ 은 조선 고 유어의 어소(語素)를 잘 살렸기에 민족성이 뛰어난 문체로 변하였다 . 또한 ‘평양판’ 이 간 행되기 한 달 전에 조선어문연구회는『조선어문법』을 간행했는데 , 리극로는 이 문법서의 주요 집필자 중의 한 사람이었다 . ‘평양판’ 은 절음부(絶音符)의 채용 , 여린 히읗을 “목청 터 침소리 (聲帶破障音)” 의 음성표기에 사용하는 등 ,『조선어문법』과 유사점도 보인다 . 리극로는 조선어연구사에 있어서 빼놓을 수 없는 연구자의 한 사람이며 , 아울러 이 ‘평양판’ 도 재조명되어야 할 것이다 .研究論

    書評 治安維持法下の朝鮮語学会事件

    Get PDF

    編集後記

    Get PDF

    編集後記

    Get PDF

    Subband beamforming with higher order statistics for distant speech recognition

    Get PDF
    This dissertation presents novel beamforming methods for distant speech recognition (DSR). Such techniques can relieve users from the necessity of putting on close talking microphones. DSR systems are useful in many applications such as humanoid robots, voice control systems for automobiles, automatic meeting transcription systems and so on. A main problem in DSR is that recognition performance is seriously degraded when a speaker is far from the microphones. In order to avoid the degradation, noise and reverberation should be removed from signals received with the microphones. Acoustic beamforming techniques have a potential to enhance speech from the far field with little distortion since they can maintain a distortionless constraint for a look direction. In beamforming, multiple signals propagating from a position are captured with multiple microphones. Typical conventional beamformers then adjust their weights so as to minimize the variance of their own outputs subject to a distortionless constraint in a look direction. The variance is the average of the second power (square) of the beamformer\u27s outputs. Accordingly, it is considered that the conventional beamformer uses second orderstatistics (SOS) of the beamformer\u27s outputs. The conventional beamforming techniques can effectively place a null on any source of interference. However, the desired signal is also canceled in reverberant environments, which is known as the signal cancellation problem. To avoid that problem, many algorithms have been developed. However, none of the algorithms can essentially solve the signal cancellation problem in reverberant environments. While many efforts have been made in order to overcome the signal cancellation problem in the field of acoustic beamforming, researchers have addressed another research issue with the microphone array, that is, blind source separation (BSS) [1]. The BSS techniques aim at separating sources from the mixture of signals without information about the geometry of the microphone array and positions of sources. It is achieved by multiplying an un-mixing matrix with input signals. The un-mixing matrix is constructed so that the outputs are stochastically independent. Measuring the stochastic independence of the signals is based on the theory of the independent component analysis (ICA) [1]. The field of ICA is based on the fact that distributions of information-bearing signals are not Gaussian and distributions of sums of various signals are close to Gaussian. There are two popular criteria for measuring the degree of the non-Gaussianity, namely, kurtosis and negentropy. As described in detail in this thesis, both criteria use more than the second moment. Accordingly, it is referred to as higher order statistics (HOS) in contrast to SOS. HOS is not considered in the field of acoustic beamforming well although Arai et al. showed the similarity between acoustic beamforming and BSS [2]. This thesis investigates new beamforming algorithms which take into consideration higher-order statistics (HOS). The new beamforming methods adjust the beamformer\u27s weights based on one of the following criteria: • minimum mutual information of the two beamformer\u27s outputs, • maximum negentropy of the beamformer\u27s outputs and • maximum kurtosis of the beamformer\u27s outputs. Those algorithms do not suffer from the signal cancellation, which is shown in this thesis. Notice that the new beamforming techniques can keep the distortionless constraint for the direction of interest in contrast to the BSS algorithms. The effectiveness of the new techniques is finally demonstrated through a series of distant automatic speech recognition experiments on real data recorded with real sensors unlike other work where signals artificially convolved with measured impulse responses are considered. Significant improvements are achieved by the beamforming algorithms proposed here.Diese Dissertation präsentiert neue Methoden zur Spracherkennung auf Entfernung. Mit diesen Methoden ist es möglich auf Nahbesprechungsmikrofone zu verzichten. Spracherkennungssysteme, die auf Nahbesprechungsmikrofone verzichten, sind in vielen Anwendungen nützlich, wie zum Beispiel bei Humanoiden-Robotern, in Voice Control Systemen für Autos oder bei automatischen Transcriptionssystemen von Meetings. Ein Hauptproblem in der Spracherkennung auf Entfernung ist, dass mit zunehmendem Abstand zwischen Sprecher und Mikrofon, die Genauigkeit der Spracherkennung stark abnimmt. Aus diesem Grund ist es elementar die Störungen, nämlich Hintergrundgeräusche, Hall und Echo, aus den Mikrofonsignalen herauszurechnen. Durch den Einsatz von mehreren Mikrofonen ist eine räumliche Trennung des Nutzsignals von den Störungen möglich. Diese Methode wird als akustisches Beamformen bezeichnet. Konventionelle akustische Beamformer passen ihre Gewichte so an, dass die Varianz des Ausgangssignals minimiert wird, wobei das Signal in "Blickrichtung" die Bedingung der Verzerrungsfreiheit erfüllen muss. Die Varianz ist definiert als das quadratische Mittel des Ausgangssignals.Somit werden bei konventionellen Beamformingmethoden Second-Order Statistics (SOS) des Ausgangssignals verwendet. Konventionelle Beamformer können Störquellen effizient unterdrücken, aber leider auch das Nutzsignal. Diese unerwünschte Unterdrückung des Nutzsignals wird im Englischen signal cancellation genannt und es wurden bereits viele Algorithmen entwickelt um dies zu vermeiden. Keiner dieser Algorithmen, jedoch, funktioniert effektiv in verhallter Umgebung. Eine weitere Methode das Nutzsignal von den Störungen zu trennen, diesesmal jedoch ohne die geometrische Information zu nutzen, wird Blind Source Separation (BSS) [1] genannt. Hierbei wird eine Matrixmultiplikation mit dem Eingangssignal durchgeführt. Die Matrix muss so konstruiert werden, dass die Ausgangssignale statistisch unabhängig voneinander sind. Die statistische Unabhängigkeit wird mit der Theorie der Independent Component Analysis (ICA) gemessen [1]. Die ICA nimmt an, dass informationstragende Signale, wie z.B. Sprache, nicht gaußverteilt sind, wohingegen die Summe der Signale, z.B. das Hintergrundrauschen, gaußverteilt sind. Es gibt zwei gängige Arten um den Grad der Nichtgaußverteilung zu bestimmen, Kurtosis und Negentropy. Wie in dieser Arbeit beschrieben, werden hierbei höhere Momente als das zweite verwendet und somit werden diese Methoden als Higher-Order Statistics (HOS) bezeichnet. Obwohl Arai et al. zeigten, dass sich Beamforming und BSS ähnlich sind, werden HOS beim akustischen Beamforming bisher nicht verwendet [2] und beruhen weiterhin auf SOS. In der hier vorliegenden Dissertation werden neue Beamformingalgorithmen entwickelt und evaluiert, die auf HOS basieren. Die neuen Beamformingmethoden passen ihre Gewichte anhand eines der folgenden Kriterien an: • Minimum Mutual Information zweier Beamformer Ausgangssignale • Maximum Negentropy der Beamformer Ausgangssignale und • Maximum Kurtosis der Beamformer Ausgangssignale. Es wird anhand von Spracherkennerexperimenten (gemessen in Wortfehlerrate) gezeigt, dass die hier entwickelten Beamformingtechniken auch erfolgreich Störquellen in verhallten Umgebungen unterdrücken, was ein klarer Vorteil gegenüber den herkömmlichen Methoden ist
    corecore