13 research outputs found

    雑音特性の変動を伴う多様な環境で実用可能な音声強調

    Get PDF
    筑波大学 (University of Tsukuba)201

    球面調和関数展開に基づく近接音抽出を用いた時間-周波数マスク推定による近接/遠方音分離

    Get PDF
    We propose the combination of a physical-model-based and a deep-learning (DL)-based source separation for near- and far-field source separation. The DL-based near- and far-field source separation method uses spherical-harmonic-analysis-based acoustic features. Deep learning is a state-of-the-art technique for source separation. In this approach, a bidirectional long short term memory (BLSTM) is used to predict a time-frequency (T-F) mask. To accurately predict a T-F mask, it is necessary to use acoustic features that have high mutual information with the oracle T-F mask. In this study, low-frequency-band near- and far-field sources are estimated based on spherical harmonic analysis and used as acoustic features. Subsequently, a DNN predicts a T-F mask to separate all frequency bands. Our experimental results show that the proposed method improved the signal-to-distortion-rate by 8-10 dB compared to the harmonic-analysis-based method. IIn addition, the proposed method improved the PESQ and STOI compared to the conventional DL-based T-F mask estimation method

    深層学習に基づく音源情報推定のための確率論的目的関数の研究

    Get PDF
     本研究は,マイクロホンで観測した音響信号から,源信号や音源の種類や状態などの音に関係する情報である「音源情報」を推定する研究である.音源情報推定の題材として,源信号と雑音が重畳した観測信号から源信号を推定する「音源強調」と,観測信号に含まれる環境音の種類や状態を推定して周囲の危険を予測/察知する「異常音検知」に焦点を当てる.音源の種類や状態などの潜在的な音源情報を考慮しながら音源強調ができれば,大歓声に包まれたサッカースタジアムで,特定の選手の声やボールのキック音を推定でき,まるでサッカースタジアムに潜り込んだようなコンテンツ視聴の方法をユーザに提供可能になる.観測信号に含まれる環境音の種類や状態を推定する異常音検知が実現すれば,機器の動作音から,その機器の動作が正常か異常か(状態)を推定できるようになり,製造/保守業務の効率化ができる. 音源情報を推定するための手法として,統計的機械学習に基づくアプローチが研究されており,近年では深層学習を音源情報推定に適用することで,その推定精度が大きく向上している.深層学習に基づく音源情報推定では,ニューラルネットワークを観測信号から所望の音源情報への非線形写像関数として用いる.そしてニューラルネットワークを音源情報の推定精度を評価する「目的関数」の値を最大化/最小化するように求める.多くの深層学習において目的関数には,二乗誤差関数や交差エントロピー関数などの決定論的な目的関数が用いられる. 音源情報推定において目的関数の設計とは,所望の音源情報の性質や推定精度を定義することと等価である.音源情報の中は,決定論的な目的関数では音源情報の性質や推定精度を定義できないものや,もしくは定義することが妥当ではないものも存在する.例えば,人間の主観的な音質評価を最大化する源信号や,異常音(ラベルデータ)が収集できない音源の状態の推定のための目的関数には,決定論的な目的関数は採用できない.この問題を解決するためには,ネットワークの構造だけでなく,ニューラルネットワークの学習に用いる目的関数を高度化しなくてはならない. 本研究では,決定論的な関数で目的関数を設計できない音源情報を推定するために,深層学習に基づく音源情報推定のための目的関数の研究を行う.所望の音源情報の性質や推定精度を,推定したい音源情報の特性や解きたい問題に応じて入出力値がとるべき値の確率分布や集合として定義し,ニューラルネットワークの入出力が満たすべき統計的な性質を目的関数として記述するという着想からこの問題に取り組む. 3 章では,スポーツの競技音など,ラベルデータが十分に存在しない源信号を強調するための手法を提案する.少量の学習データでニューラルネットワークを学習するためには,事前に設計/選択した音響特徴量を観測信号から抽出し,小規模なニューラルネットワークで音源強調を行う必要がある.3 章では,所望の音源を強調するための適切な音響特徴量を,相互情報量最大化に基づき選択する方法を検討した.この際,特徴量候補の次元数が大きい音響特徴量選択に相互情報量を正確に計算する "カーネル次元圧縮法" を適用することを考え,スパース正則化法に基づく微分可能な目的関数を導出し,大量な音響特徴量候補から適切な音響特徴量を勾配法により選択できる音響特徴量選択法を提案した.定量評価試験では,従来の音響特徴量選択法と比べSDR が向上することを示し,また主観評価試験では,提案法を用いて音響特徴量を選択することで従来法と比べ源信号の明瞭性が向上することを示した.この成果により,これまで推定が困難とされていた,学習データが十分に得られないような源信号や,これまで源信号の推定対象とされてこず,適切な音響特徴量が未知な源信号も推定できるようになった. 4 章では,音源強調の出力音の主観品質を向上させるために,ラベルデータを一意に定めることができず,二乗誤差などの目的関数で推定精度を定義することが妥当でない源信号を強調するための手法を提案する.従来の深層学習に基づく音源強調では,源信号の振幅スペクトルなどをラベルデータとし,ニューラルネットワークの出力とラベルデータの二乗誤差を最小化するように学習をしてきた.このため,出力音に歪が生じて主観品質が低下するという問題があった.そこで4 章では,ラベルデータを用意する代わりに主観評価値と相関の高い音質評価値(聴感評点)を最大化するようための目的関数を提案した.定量評価試験では,提案する目的関数を利用することで,聴感評点を最大化するようにニューラルネットワークを学習できることを確認した.また主観評価試験では,提案法は従来の二乗誤差最小化に基づく目的関数を利用した音源強調よりも高い主観品質で音源強調できることを示した.この成果により,これまで音源強調の学習に利用できなかった聴感評点や人間の評価などの,より\高次" な評価尺度を目的関数として利用できるようになり,ニューラルネットワークを用いた音源強調の応用範囲を広げることができる. 5 章では,モーターの異常回転音やベアリングのぶつかり音などの普段発生しない音(異常音)を検知し,機器動作の状態が正常か異常かを判定することで機器の故障を検知する「異常音検知」の実現を目指す.この問題の難しさは,機器の故障頻度がきわめて低いため,機器の異常動作音(ラベルデータ)が収集できず,一般的な識別のためのニューラルネットワークの目的関数である交差エントロピーが利用できない点にある.そこで5 章では,正常音が従う確率分布と統計的に差異がある音を異常音と定義することで異常音検知を仮説検定とみなし,異常音検知器を最適化するための目的関数として,仮説検定の最適化基準であるネイマン・ピアソンの補題から"ネイマン・ピアソン指標" を導出した.定量評価試験では,従来法と比べ調和平均が向上したことから,提案法が従来法よりも安定して異常音検知できることを示した.また実環境実験では3D プリンタや送風ポンプの突発的な異常音や,ベアリングの傷などに起因する持続的な異常音を検知できることを示した.この成果により,異常音データの集まらない状態識別問題を安定的に解くことが可能になり,銃声検知や未知話者検出などのセキュリティのための音源情報推定技術など,負例データの収集が困難な様々な音源情報推定へと応用ができる.電気通信大学201

    Acoustic sensor network geometry calibration and applications

    Get PDF
    In the modern world, we are increasingly surrounded by computation devices with communication links and one or more microphones. Such devices are, for example, smartphones, tablets, laptops or hearing aids. These devices can work together as nodes in an acoustic sensor network (ASN). Such networks are a growing platform that opens the possibility for many practical applications. ASN based speech enhancement, source localization, and event detection can be applied for teleconferencing, camera control, automation, or assisted living. For this kind of applications, the awareness of auditory objects and their spatial positioning are key properties. In order to provide these two kinds of information, novel methods have been developed in this thesis. Information on the type of auditory objects is provided by a novel real-time sound classification method. Information on the position of human speakers is provided by a novel localization and tracking method. In order to localize with respect to the ASN, the relative arrangement of the sensor nodes has to be known. Therefore, different novel geometry calibration methods were developed. Sound classification The first method addresses the task of identification of auditory objects. A novel application of the bag-of-features (BoF) paradigm on acoustic event classification and detection was introduced. It can be used for event and speech detection as well as for speaker identification. The use of both mel frequency cepstral coefficient (MFCC) and Gammatone frequency cepstral coefficient (GFCC) features improves the classification accuracy. By using soft quantization and introducing supervised training for the BoF model, superior accuracy is achieved. The method generalizes well from limited training data. It is working online and can be computed in a fraction of real-time. By a dedicated training strategy based on a hierarchy of stationarity, the detection of speech in mixtures with noise was realized. This makes the method robust against severe noises levels corrupting the speech signal. Thus it is possible to provide control information to a beamformer in order to realize blind speech enhancement. A reliable improvement is achieved in the presence of one or more stationary noise sources. Speaker localization The localization method enables each node to determine the direction of arrival (DoA) of concurrent sound sources. The author's neuro-biologically inspired speaker localization method for microphone arrays was refined for the use in ASNs. By implementing a dedicated cochlear and midbrain model, it is robust against the reverberation found in indoor rooms. In order to better model the unknown number of concurrent speakers, an application of the EM algorithm that realizes probabilistic clustering according to auditory scene analysis (ASA) principles was introduced. Based on this approach, a system for Euclidean tracking in ASNs was designed. Each node applies the node wise localization method and shares probabilistic DoA estimates together with an estimate of the spectral distribution with the network. As this information is relatively sparse, it can be transmitted with low bandwidth. The system is robust against jitter and transmission errors. The information from all nodes is integrated according to spectral similarity to correctly associate concurrent speakers. By incorporating the intersection angle in the triangulation, the precision of the Euclidean localization is improved. Tracks of concurrent speakers are computed over time, as is shown with recordings in a reverberant room. Geometry calibration The central task of geometry calibration has been solved with special focus on sensor nodes equipped with multiple microphones. Novel methods were developed for different scenarios. An audio-visual method was introduced for the calibration of ASNs in video conferencing scenarios. The DoAs estimates are fused with visual speaker tracking in order to provide sensor positions in a common coordinate system. A novel acoustic calibration method determines the relative positioning of the nodes from ambient sounds alone. Unlike previous methods that only infer the positioning of distributed microphones, the DoA is incorporated and thus it becomes possible to calibrate the orientation of the nodes with a high accuracy. This is very important for all applications using the spatial information, as the triangulation error increases dramatically with bad orientation estimates. As speech events can be used, the calibration becomes possible without the requirement of playing dedicated calibration sounds. Based on this, an online method employing a genetic algorithm with incremental measurements was introduced. By using the robust speech localization method, the calibration is computed in parallel to the tracking. The online method is be able to calibrate ASNs in real time, as is shown with recordings of natural speakers in a reverberant room. The informed acoustic sensor network All new methods are important building blocks for the use of ASNs. The online methods for localization and calibration both make use of the neuro-biologically inspired processing in the nodes which leads to state-of-the-art results, even in reverberant enclosures. The high robustness and reliability can be improved even more by including the event detection method in order to exclude non-speech events. When all methods are combined, both semantic information on what is happening in the acoustic scene as well as spatial information on the positioning of the speakers and sensor nodes is automatically acquired in real time. This realizes truly informed audio processing in ASNs. Practical applicability is shown by application to recordings in reverberant rooms. The contribution of this thesis is thus not only to advance the state-of-the-art in automatically acquiring information on the acoustic scene, but also pushing the practical applicability of such methods

    Scaling Machine Learning Systems using Domain Adaptation

    Get PDF
    Machine-learned components, particularly those trained using deep learning methods, are becoming integral parts of modern intelligent systems, with applications including computer vision, speech processing, natural language processing and human activity recognition. As these machine learning (ML) systems scale to real-world settings, they will encounter scenarios where the distribution of the data in the real-world (i.e., the target domain) is different from the data on which they were trained (i.e., the source domain). This phenomenon, known as domain shift, can significantly degrade the performance of ML systems in new deployment scenarios. In this thesis, we study the impact of domain shift caused by variations in system hardware, software and user preferences on the performance of ML systems. After quantifying the performance degradation of ML models in target domains due to the various types of domain shift, we propose unsupervised domain adaptation (uDA) algorithms that leverage unlabeled data collected in the target domain to improve the performance of the ML model. At its core, this thesis argues for the need to develop uDA solutions while adhering to practical scenarios in which ML systems will scale. More specifically, we consider four scenarios: (i) opaque ML systems, wherein parameters of the source prediction model are not made accessible in the target domain, (ii) transparent ML systems, wherein source model parameters are accessible and can be modified in the target domain, (iii) ML systems where source and target domains do not have identical label spaces, and (iv) distributed ML systems, wherein the source and target domains are geographically distributed, their datasets are private and cannot be exchanged using adaptation. We study the unique challenges and constraints of each scenario and propose novel uDA algorithms that outperform state-of-the-art baselines

    Acoustic-channel attack and defence methods for personal voice assistants

    Get PDF
    Personal Voice Assistants (PVAs) are increasingly used as interface to digital environments. Voice commands are used to interact with phones, smart homes or cars. In the US alone the number of smart speakers such as Amazon’s Echo and Google Home has grown by 78% to 118.5 million and 21% of the US population own at least one device. Given the increasing dependency of society on PVAs, security and privacy of these has become a major concern of users, manufacturers and policy makers. Consequently, a steep increase in research efforts addressing security and privacy of PVAs can be observed in recent years. While some security and privacy research applicable to the PVA domain predates their recent increase in popularity and many new research strands have emerged, there lacks research dedicated to PVA security and privacy. The most important interaction interface between users and a PVA is the acoustic channel and acoustic channel related security and privacy studies are desirable and required. The aim of the work presented in this thesis is to enhance the cognition of security and privacy issues of PVA usage related to the acoustic channel, to propose principles and solutions to key usage scenarios to mitigate potential security threats, and to present a novel type of dangerous attack which can be launched only by using a PVA alone. The five core contributions of this thesis are: (i) a taxonomy is built for the research domain of PVA security and privacy issues related to acoustic channel. An extensive research overview on the state of the art is provided, describing a comprehensive research map for PVA security and privacy. It is also shown in this taxonomy where the contributions of this thesis lie; (ii) Work has emerged aiming to generate adversarial audio inputs which sound harmless to humans but can trick a PVA to recognise harmful commands. The majority of work has been focused on the attack side, but there rarely exists work on how to defend against this type of attack. A defence method against white-box adversarial commands is proposed and implemented as a prototype. It is shown that a defence Automatic Speech Recognition (ASR) can work in parallel with the PVA’s main one, and adversarial audio input is detected if the difference in the speech decoding results between both ASR surpasses a threshold. It is demonstrated that an ASR that differs in architecture and/or training data from the the PVA’s main ASR is usable as protection ASR; (iii) PVAs continuously monitor conversations which may be transported to a cloud back end where they are stored, processed and maybe even passed on to other service providers. A user has limited control over this process when a PVA is triggered without user’s intent or a PVA belongs to others. A user is unable to control the recording behaviour of surrounding PVAs, unable to signal privacy requirements and unable to track conversation recordings. An acoustic tagging solution is proposed aiming to embed additional information into acoustic signals processed by PVAs. A user employs a tagging device which emits an acoustic signal when PVA activity is assumed. Any active PVA will embed this tag into their recorded audio stream. The tag may signal a cooperating PVA or back-end system that a user has not given a recording consent. The tag may also be used to trace when and where a recording was taken if necessary. A prototype tagging device based on PocketSphinx is implemented. Using Google Home Mini as the PVA, it is demonstrated that the device can tag conversations and the tagging signal can be retrieved from conversations stored in the Google back-end system; (iv) Acoustic tagging provides users the capability to signal their permission to the back-end PVA service, and another solution inspired by Denial of Service (DoS) is proposed as well for protecting user privacy. Although PVAs are very helpful, they are also continuously monitoring conversations. When a PVA detects a wake word, the immediately following conversation is recorded and transported to a cloud system for further analysis. An active protection mechanism is proposed: reactive jamming. A Protection Jamming Device (PJD) is employed to observe conversations. Upon detection of a PVA wake word the PJD emits an acoustic jamming signal. The PJD must detect the wake word faster than the PVA such that the jamming signal still prevents wake word detection by the PVA. An evaluation of the effectiveness of different jamming signals and overlap between wake words and the jamming signals is carried out. 100% jamming success can be achieved with an overlap of at least 60% with a negligible false positive rate; (v) Acoustic components (speakers and microphones) on a PVA can potentially be re-purposed to achieve acoustic sensing. This has great security and privacy implication due to the key role of PVAs in digital environments. The first active acoustic side-channel attack is proposed. Speakers are used to emit human inaudible acoustic signals and the echo is recorded via microphones, turning the acoustic system of a smartphone into a sonar system. The echo signal can be used to profile user interaction with the device. For example, a victim’s finger movement can be monitored to steal Android unlock patterns. The number of candidate unlock patterns that an attacker must try to authenticate herself to a Samsung S4 phone can be reduced by up to 70% using this novel unnoticeable acoustic side-channel

    Proceedings of the 19th Sound and Music Computing Conference

    Get PDF
    Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France). https://smc22.grame.f

    Proceedings of the Scientific-Practical Conference "Research and Development - 2016"

    Get PDF
    talent management; sensor arrays; automatic speech recognition; dry separation technology; oil production; oil waste; laser technolog
    corecore