180 research outputs found

    Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation

    Full text link
    End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji characters. Since such NE words tend to be important keywords, ASR easily loses user trust if it misrecognizes them. To solve these problems, this paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model when selecting personal names as a target NE.Comment: accepted by INTERSPEECH202

    Improvement of DOA Estimation by using Quaternion Output in Sound Event Localization and Detection

    Get PDF
    This paper describes improvement of Direction of Arrival (DOA) estimation performance using quaternion output in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3. DCASE 2019 Task3 focuses on the sound event localization and detection (SELD) which is a task that simultaneously estimates the sound source direction in addition to conventional sound event detection (SED). In the baseline method, the sound source direction angle is directly regressed. However, the angle is a periodic function and it has discontinuities which may make learning unstable. Specifical-ly, even though -180 deg and 180 deg are in the same direc-tion, a large loss is calculated. Estimating DOA angles with a classification approach instead of regression can solve such instability of discontinuities but this causes limitation of reso-lution. In this paper, we propose to introduce the quaternion which is a continuous function into the output layer of the neural network instead of directly estimating the sound source direction angle. This method can be easily implemented only by changing the output of the existing neural network, and thus does not significantly increase the number of parameters in the middle layers. Experimental results show that proposed method improves the DOA estimation without significantly increasing the number of parameters.24424

    FOOTSTEP DETECTION AND CLASSIFICATION USING DISTRIBUTED MICROPHONES

    Get PDF
    ABSTRACT This paper addresses footstep detection and classification with multiple microphones distributed on the floor. We propose to introduce geometrical features such as position and velocity of a sound source for classification which is estimated by amplitude-based localization. It does not require precise inter-microphone time synchronization unlike a conventional microphone array technique. To classify various types of sound events, we introduce four types of features, i.e., time-domain, spectral and Cepstral features in addition to the geometrical features. We constructed a prototype system for footstep detection and classification based on the proposed ideas with eight microphones aligned in a 2-by-4 grid manner. Preliminary classification experiments showed that classification accuracy for four types of sound sources such as a walking footstep, running footstep, handclap, and utterance maintains over 70% even when the signal-to-noise ratio is low, like 0 dB. We also confirmed two advantages with the proposed footstep detection and classification. One is that the proposed features can be applied to classification of other sound sources besides footsteps. The other is that the use of a multichannel approach further improves noise-robustness by selecting the best microphone among the microphones, and providing geometrical information on a sound source

    Noise robust 2D bird localization via sound using microphone arrays

    Get PDF
    Birds in the wild are difficult to localize, because their sizes tend to be small, they move swiftly, and they are often visually occluded. However, their location information is crucial for ethological studies on birds' behaviour. Recently, automating the process has been studied as a hot topic, where spatial sensors and sensor networks are commonly used. To avoid the visual occlusion problem, many studies focus on acoustic signal processing by applying microphone arrays and perform 1D azimuth localization through bird songs. In this study, we perform 2D sound source localization in the Cartesian coordinates using azimuths from multiple microphone arrays. To estimate the exact bird's location, we calculate the intersection points of these azimuth lines. Although this approach is simple and easy to be implemented, it has two main issues. One is that even small noise interference in azimuth values results in corrupting the localization data. This leads to a problem, where the intersection points between the azimuth lines do not intersect in one point for a single bird, but in several points. This proves difficulty in estimating the exact location of each bird. Especially in a far-field application, even small noise corruption leads to large localization errors. The other issue is that in the bird's natural habitat, elements such as leaves, grass and rivers are natural noise sources. It is difficult to extract the bird songs in such a noisy environment. We propose an algorithm involving statistic methods, sound feature analysis and machine learning. Based on this approach, a noise robust bird localization system has been established. We have performed numerous simulations to further understand the limitations of the system. Based on the results we have also derived the system's design guidelines, describing how the results change depending on the number of microphone arrays, signal-to-noise ratio, bird's distance from the devices, array's transfer function, type of the singing bird and specific parameter settings used in the algorithms. Such detailed guidelines support interested researchers in creating a similar system, which can contribute to ethological researches

    A robot uses its own microphone to synchronize its steps to musical beats while scatting and singing

    Full text link
    Abstract—Musical beat tracking is one of the effective technologies for human-robot interaction such as musical ses-sions. Since such interaction should be performed in various environments in a natural way, musical beat tracking for a robot should cope with noise sources such as environmental noise, its own motor noises, and self voices, by using its own microphone. This paper addresses a musical beat tracking robot which can step, scat and sing according to musical beats by using its own microphone. To realize such a robot, we propose a robust beat tracking method by introducing two key techniques, that is, spectro-temporal pattern matching and echo cancellation. The former realizes robust tempo estimation with a shorter window length, thus, it can quickly adapt to tempo changes. The latter is effective to cancel self noises such as stepping, scatting, and singing. We implemented the proposed beat tracking method for Honda ASIMO. Experimental results showed ten times faster adaptation to tempo changes and high robustness in beat tracking for stepping, scatting and singing noises. We also demonstrated the robot times its steps while scatting or singing to musical beats. I
    corecore