3,451 research outputs found
Automatic Conflict Detection in Police Body-Worn Audio
Automatic conflict detection has grown in relevance with the advent of
body-worn technology, but existing metrics such as turn-taking and overlap are
poor indicators of conflict in police-public interactions. Moreover, standard
techniques to compute them fall short when applied to such diversified and
noisy contexts. We develop a pipeline catered to this task combining adaptive
noise removal, non-speech filtering and new measures of conflict based on the
repetition and intensity of phrases in speech. We demonstrate the effectiveness
of our approach on body-worn audio data collected by the Los Angeles Police
Department.Comment: 5 pages, 2 figures, 1 tabl
Sampling-based speech parameter generation using moment-matching networks
This paper presents sampling-based speech parameter generation using
moment-matching networks for Deep Neural Network (DNN)-based speech synthesis.
Although people never produce exactly the same speech even if we try to express
the same linguistic and para-linguistic information, typical statistical speech
synthesis produces completely the same speech, i.e., there is no
inter-utterance variation in synthetic speech. To give synthetic speech natural
inter-utterance variation, this paper builds DNN acoustic models that make it
possible to randomly sample speech parameters. The DNNs are trained so that
they make the moments of generated speech parameters close to those of natural
speech parameters. Since the variation of speech parameters is compressed into
a low-dimensional simple prior noise vector, our algorithm has lower
computation cost than direct sampling of speech parameters. As the first step
towards generating synthetic speech that has natural inter-utterance variation,
this paper investigates whether or not the proposed sampling-based generation
deteriorates synthetic speech quality. In evaluation, we compare speech quality
of conventional maximum likelihood-based generation and proposed sampling-based
generation. The result demonstrates the proposed generation causes no
degradation in speech quality.Comment: Submitted to INTERSPEECH 201
Efficient Invariant Features for Sensor Variability Compensation in Speaker Recognition
In this paper, we investigate the use of invariant features for speaker recognition. Owing to their characteristics, these features are introduced to cope with the difficult and challenging problem of sensor variability and the source of performance degradation inherent in speaker recognition systems. Our experiments show: (1) the effectiveness of these features in match cases; (2) the benefit of combining these features with the mel frequency cepstral coefficients to exploit their discrimination power under uncontrolled conditions (mismatch cases). Consequently, the proposed invariant features result in a performance improvement as demonstrated by a reduction in the equal error rate and the minimum decision cost function compared to the GMM-UBM speaker recognition systems based on MFCC features
Human abnormal behavior impact on speaker verification systems
Human behavior plays a major role in improving human-machine communication. The performance must be affected by abnormal behavior as systems are trained using normal utterances. The abnormal behavior is often associated with a change in the human emotional state. Different emotional states cause physiological changes in the human body that affect the vocal tract. Fear, anger, or even happiness we recognize as a deviation from a normal behavior. The whole spectrum of human-machine application is susceptible to behavioral changes. Abnormal behavior is a major factor, especially for security applications such as verification systems. Face, fingerprint, iris, or speaker verification is a group of the most common approaches to biometric authentication today. This paper discusses human normal and abnormal behavior and its impact on the accuracy and effectiveness of automatic speaker verification (ASV). The support vector machines classifier inputs are Mel-frequency cepstral coefficients and their dynamic changes. For this purpose, the Berlin Database of Emotional Speech was used. Research has shown that abnormal behavior has a major impact on the accuracy of verification, where the equal error rate increase to 37 %. This paper also describes a new design and application of the ASV system that is much more immune to the rejection of a target user with abnormal behavior.Web of Science6401274012
PIANO: Proximity-based User Authentication on Voice-Powered Internet-of-Things Devices
Voice is envisioned to be a popular way for humans to interact with
Internet-of-Things (IoT) devices. We propose a proximity-based user
authentication method (called PIANO) for access control on such voice-powered
IoT devices. PIANO leverages the built-in speaker, microphone, and Bluetooth
that voice-powered IoT devices often already have. Specifically, we assume that
a user carries a personal voice-powered device (e.g., smartphone, smartwatch,
or smartglass), which serves as the user's identity. When another voice-powered
IoT device of the user requires authentication, PIANO estimates the distance
between the two devices by playing and detecting certain acoustic signals;
PIANO grants access if the estimated distance is no larger than a user-selected
threshold. We implemented a proof-of-concept prototype of PIANO. Through
theoretical and empirical evaluations, we find that PIANO is secure, reliable,
personalizable, and efficient.Comment: To appear in ICDCS'1
- …