30 research outputs found

    Sound Event Detection Using Spatial Features and Convolutional Recurrent Neural Network

    Get PDF
    This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection. We extend the convolutional recurrent neural network to handle more than one type of these multichannel features by learning from each of them separately in the initial stages. We show that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multichannel audio better when they are presented as separate layers of a volume. Using the proposed spatial features over monaural features on the same network gives an absolute F-score improvement of 6.1% on the publicly available TUT-SED 2016 dataset and 2.7% on the TUT-SED 2009 dataset that is fifteen times larger.Comment: Accepted for IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2017

    Passiivimuotojen aktiivistuminen suomen kielessä

    Get PDF

    The automatic analysis of classroom talk

    Get PDF
    The SMART SPEECH Project is a joint venture between three Finnish universities and a Chilean university. The aim is to develop a mobile application that can be used to record classroom talk and enable observations to be made of classroom interactions. We recorded Finnish and Chilean physics teachers’ speech using both a conventional microphone/dictator setup and a microphone/mobile application setup. The recordings were analysed via automatic speech recognition (ASR). The average word error rate achieved for the Finnish teachers’ speech was under 40%. The ASR approach also enabled us to determine the key topics discussed within the Finnish physics lessons under scrutiny. The results here were promising as the recognition accuracy rate was about 85% on average

    Mobile Microphone Array Speech Detection and Localization in Diverse Everyday Environments

    Get PDF
    Joint sound event localization and detection (SELD) is an integral part of developing context awareness into communication interfaces of mobile robots, smartphones, and home assistants. For example, an automatic audio focus for video capture on a mobile phone requires robust detection of relevant acoustic events around the device and their direction. Existing SELD approaches have been evaluated using material produced in controlled indoor environments, or the audio is simulated by mixing isolated sounds to different spatial locations. This paper studies SELD of speech in diverse everyday environments, where the audio corresponds to typical usage scenarios of handheld mobile devices. In order to allow weighting the relative importance of localization vs. detection, we will propose a two-stage hierarchical system, where the first stage is to detect the target events, and the second stage is to localize them. The proposed method utilizes convolutional recurrent neural network (CRNN) and is evaluated on a database of manually annotated microphone array recordings from various acoustic conditions. The array is embedded in a contemporary mobile phone form factor. The obtained results show good speech detection and localization accuracy of the proposed method in contrast to a non-hierarchical flat classification model.acceptedVersionPeer reviewe

    Audio source separation into the wild

    Get PDF
    International audienceThis review chapter is dedicated to multichannel audio source separation in real-life environment. We explore some of the major achievements in the field and discuss some of the remaining challenges. We will explore several important practical scenarios, e.g. moving sources and/or microphones, varying number of sources and sensors, high reverberation levels, spatially diffuse sources, and synchronization problems. Several applications such as smart assistants, cellular phones, hearing aids and robots, will be discussed. Our perspectives on the future of the field will be given as concluding remarks of this chapter

    Acoustic Source Localization in a Room Environment and at Moderate Distances

    Get PDF
    The pressure changes of an acoustic wavefront are sensed with a microphone that acts as a transducer, converting sound pressure into voltage. The voltage is then converted into digital form with an analog to digital (AD) -converter to provide a discrete time quantized digital signal. This thesis discusses methods to estimate the location of a sound source from the signals of multiple microphones. Acoustic source localization (ASL) can be used to locate talkers, which is useful for speech communication systems such as teleconferencing and hearing aids. Active localization methods receive and send energy, whereas passive methods only receive energy. The discussed ASL methods are passive which makes them attractive for surveillance applications, such as localization of vehicles and monitoring of areas. This thesis focuses on ASL in a room environment and at moderate distances that are often present in outdoor applications. The frequency range of many commonly occurring sounds such as speech, vehicles, and jet aircraft is large. Time delay estimation (TDE) methods are suitable for estimating properties from such wideband signals. Since TDE methods have been extensively studied, the theory is attractive to apply in localization. Time difference of arrival (TDOA) -based methods estimate the source location from measured TDOA values between microphones. These methods are computationally attractive but deteriorate rapidly when the TDOA estimates are no longer directly related to the source position. In a room environment such conditions could be faced when reverberation or noise starts to dominate TDOA estimation. The combination of microphone pairwise TDE measurements is studied as a more robust localization solution. TDE measurements are combined into a spatial likelihood function (SLF) of source position. A sequential Bayesian method known as particle filtering (PF) is used to estimate the source position. The PF based localization accuracy increases when the variance of SLF decreases. Results from simulations and real-data show that multiplication (intersection operation) results in a SLF with smaller variance than the typically applied summation (union operation). The above localization methods assume that the source is located in the near-field of the microphone array, i.e., the source emitted wavefront curvature is observable. In the far-field, the source wavefront is assumed planar and localization is considered by using spatially separated direction observations. The direction of arrival (DOA) of a source emitted wavefront impinging on a microphone array is traditionally estimated by steering the array to a direction that maximizes the steered response power. Such estimates can be deteriorated by noise and reverberation. Therefore, talker localization is considered using DOA discrimination. The sound propagation delay from the source to the microphone array becomes significant at moderate distances. As a result, the directional observations from a moving sound source point behind the true source position. Omitting the propagation delay results in a biased location estimate of a moving or discontinuously emitting source. To solve this problem the propagation delay is proposed to be modeled in the estimation process. Motivated by the robustness of localization using the combination of TDE measurements, source localization by directly combining the TDE-based array steered responses is considered. This extends the near-field talker localization methods to far-field source localization. The presented propagation delay modeling is then proposed for the steered response localization. The improvement in localization accuracy by including the propagation delay is studied using a simulated moving sound source in the atmosphere. The presented indoor localization methods have been evaluated in the Classification of Events, Activities and Relationships (CLEAR) 2006 and CLEAR'07 technology evaluations. In the evaluations, the performance of the proposed ASL methods was evaluated by a third party from several hours of annotated data. The data was gathered from meetings held in multiple smart rooms. According to the obtained results from CLEAR'07 development dataset (166 min) presented in this thesis, 92 % of speech activity in a meeting situation was located within 17 cm accuracy

    Lisämyynti osana asiakaspalveluprosessia Sokos Hotel Ilveksessä

    Get PDF
    Työn toimeksiantajana oli Sokos Hotel Ilves. Työn tavoitteena oli kehittää Sokos Hotel Ilveksen lisämyyntiä asiakkaan näkökulmasta. Tutkimuksen tavoitteena oli myös laatia tutkimustulosten pohjalta lisämyyntimalli, jossa kuvailtiin Sokos Hotel Ilveksen asiakaspalveluprosessin vaiheiden lisämyyntimahdollisuudet. Tavoitteiden saavuttamiseksi tutkittiin Sokos Hotel Ilveksessä yöpyneiden asiakkaiden mielipiteitä ja kokemuksia vastaanottovirkailijan suosittelusta sekä toimipaikkamainonnasta. Tutkimuksessa käytettiin teemahaastattelua, joka on laadullinen tutkimusmenetelmä. Työssä haastateltiin puhelimitse 32 asiakasta. Puhelinhaastattelut toteutettiin loppuvuodesta 2009. Tutkimusaineisto purettiin teemoittelemalla, jolloin teemoiksi nousivat suosittelu palvelumuotona ja lisämyynnin konkretisointi. Tutkimusaineistoa käsiteltiin myös asiakaspalvelutapahtuman vaiheiden kautta. Haastatteluista kävi ilmi, että asiakkaat kokivat lisämyynnin osaksi hyvää asiakaspalvelua. Vastaanottovirkailijan tekemä suosittelu vaikutti myönteisesti asiakkaiden viihtyvyyteen sekä ostopäätöksiin. Erityisen tärkeänä lisämyyntiä pitivät asiakkaat, jotka yöpyivät ensimmäistä kertaa Sokos Hotel Ilveksessä. Vähiten suosittelua arvostivat liikematkustajat, jotka majoittuivat usein Sokos Hotel Ilveksessä. Tulokset osoittivat, että Sokos Hotel Ilveksen on kehitettävä erityisesti toimipaikkamainontaa asiakkaiden tarpeita vastaavaksi. Tutkimustulosten pohjalta syntyneessä lisämyyntimallissa vastaanottovirkailijaa opastetaan käytännön esimerkkien kautta suosittelemaan oikeissa kohdissa asiakkaan tarpeita vastaavia tuotteita ja palveluita. Lisämyyntimallia voidaan hyödyntää tulevaisuudessa Sokos Hotel Ilveksen lisäksi muissakin Sokos Hotels -ketjun hotelleissa.The commissioner of the thesis was Sokos Hotel Ilves and the objective was to develop supplementary selling to be executed at Sokos Hotel Ilves, from the customer point of view. Another objective was to make a supplementary selling model describing the supplementary selling phases in the customer service process of Sokos Hotel Ilves. The supplementary selling model was based on the research results. To reach the objectives a survey on customers´ opinions and experiences of the recommendations made by the receptionist, was conducted. The advertisement at the hotel was also studied. The research was executed by using a theme interview which is a qualitative research method. At the end of the year 2009 32 customers were interviewed by telephone. The research results were analyzed by themes. The themes were recommendation as a service type and concretizing the supplementary selling. The research results were also analyzed through the customer service process. The results showed that the customers considered recommendation to be good customer service. Recommendations made by the receptionist affected customer satisfaction and also the decisions to purchase in a positive way. Especially the customers who stayed at Sokos Hotel Ilves for the first time considered supplementary selling very important. The customers who had stayed at Sokos Hotel Ilves before did not value recommendations. The research showed that Sokos Hotel Ilves should especially focus on advertisement at the hotel to meet the customer needs. The supplementary selling model points out by practical examples how the receptionist can recommend the services and products to meet the customer needs at the right time. In the future the supplementary selling model can be utilized besides Sokos Hotel Ilves in other Sokos Hotels

    Data-Dependent Ensemble of Magnitude Spectrum Predictions for Single Channel Speech Enhancement

    Get PDF
    The time-frequency mask and the magnitude spectrum are two common targets for deep learning-based speech enhancement. Both the ensemble and the neural network fusion of magnitude spectra obtained with these approaches have been shown to improve the objective perceptual quality with synthetic mixtures of data. This work generalizes the ensemble approach by proposing neural network layers to predict time-frequency varying weights for the combination of the two magnitude spectra. In order to combine the best individual magnitude spectrum estimates, the weight prediction network is trained after the time-frequency mask and magnitude spectrum sub-networks have been separately trained for their corresponding objectives and their weights have been frozen. Using the publicly available CHiME3 -challenge data, which consists of both simulated and real speech recordings in everyday environments with noise and interference, the proposed approach leads to significantly higher noise suppression in terms of segmental source-to-distortion ratio over the alternative approaches. In addition, the approach achieves similar improvements in the average objective instrumentally measured intelligibility scores with respect to the best achieved scores.acceptedVersionPeer reviewe

    Microphone-Array-Based Speech Enhancement Using Neural Networks

    No full text
    This chapter analyses the use of artificial neural networks (ANNs) in learning to predict time-frequency (TF) masks from the noisy input data. Artificial neural networks are inspired by the operation of biological neural networks, where individual neurons receive inputs from other connected neurons. The chapter focuses on TF mask prediction for speech enhancement in dynamic noise environments using artificial neural networks. It reviews the enhancement framework of microphone array signals using beamforming with post-filtering. The chapter presents an overview of the supervised learning framework used for the TF mask-based speech enhancement. It explores the effectiveness of feed-forward neural networks for a real-world enhancement application using recordings from everyday noisy environments, where a microphone array is used to capture the signals. Estimated instrumental intelligibility and signal-to-noise ratio (SNR) scores are evaluated to measure how well the predicted masks improve speech quality, using networks trained on different input features.acceptedVersionPeer reviewe
    corecore