331 research outputs found
Computer Graphics and Video Features for Speaker Recognition
Tato práce popisuje netradiční metodu rozpoznávání řečníka pomocí příznaků a alogoritmů používaných převážně v počítačovém vidění. V úvodu jsou shrnuty potřebné teoretické znalosti z oblasti počítačového rozpoznávání. Jako aplikace grafických příznaků v rozpoznávání řečníka jsou detailněji popsány již známé BBF příznaky. Tyto jsou vyhodnoceny nad standardními řečovými databázemi TIMIT a NIST SRE 2010. Experimentální výsledky jsou shrnuty a porovnány se standardními metodami. V závěru jsou jsou navrženy možné směry budoucí práce.We describe a non-traditional method for speaker recognition that uses features and algorithms used mainly for computer vision. Important theoretical knowledge of computer recognition is summarized first. The Boosted Binary Features are described and explored as an already proposed method, that has roots in computer vision. This method is evaluated on standard speaker recognition databases TIMIT and NIST SRE 2010. Experimental results are given and compared to standard methods. Possible directions for future work are proposed at the end.
About Voice: A Longitudinal Study of Speaker Recognition Dataset Dynamics
Like face recognition, speaker recognition is widely used for voice-based
biometric identification in a broad range of industries, including banking,
education, recruitment, immigration, law enforcement, healthcare, and
well-being. However, while dataset evaluations and audits have improved data
practices in computer vision and face recognition, the data practices in
speaker recognition have gone largely unquestioned. Our research aims to
address this gap by exploring how dataset usage has evolved over time and what
implications this has on bias and fairness in speaker recognition systems.
Previous studies have demonstrated the presence of historical, representation,
and measurement biases in popular speaker recognition benchmarks. In this
paper, we present a longitudinal study of speaker recognition datasets used for
training and evaluation from 2012 to 2021. We survey close to 700 papers to
investigate community adoption of datasets and changes in usage over a crucial
time period where speaker recognition approaches transitioned to the widespread
adoption of deep neural networks. Our study identifies the most commonly used
datasets in the field, examines their usage patterns, and assesses their
attributes that affect bias, fairness, and other ethical concerns. Our findings
suggest areas for further research on the ethics and fairness of speaker
recognition technology.Comment: 14 pages (23 with References and Appendix
Privacy Protection for Life-log System
Tremendous advances in wearable computing and storage technologies enable us to record not just snapshots of an event but the whole human experience for a long period of time. Such a \life-logandamp;quot; system captures important events as they happen, rather than an after-thought. Such a system has applications in many areas such as law enforcement, personal archives, police questioning, and medicine. Much of the existing eandamp;reg;orts focus on the pattern recognition and information retrieval aspects of the system. On the other hand, the privacy issues raised by such an intrusive system have not received much attention from the research community. The objectives of this research project are two-fold: andamp;macr;rst, to construct a wearable life-log video system, and second, to provide a solution for protecting the identity of the subjects in the video while keeping the video useful. In this thesis work, we designed a portable wearable life-log system that implements audio distortion and face blocking in a real time to protect the privacy of the subjects who are being recorded in life-log video. For audio, our system automatically isolates the subject\u27s speech and distorts it using a pitch- shifting algorithm to conceal the identity. For video, our system uses a real-time face detection, tracking and blocking algorithm to obfuscate the faces of the subjects. Extensive experiments have been conducted on interview videos to demonstrate the ability of our system in protecting the identity of the subject while maintaining the usability of the life-log video
Self-Adaptive Soft Voice Activity Detection using Deep Neural Networks for Robust Speaker Verification
Voice activity detection (VAD), which classifies frames as speech or
non-speech, is an important module in many speech applications including
speaker verification. In this paper, we propose a novel method, called
self-adaptive soft VAD, to incorporate a deep neural network (DNN)-based VAD
into a deep speaker embedding system. The proposed method is a combination of
the following two approaches. The first approach is soft VAD, which performs a
soft selection of frame-level features extracted from a speaker feature
extractor. The frame-level features are weighted by their corresponding speech
posteriors estimated from the DNN-based VAD, and then aggregated to generate a
speaker embedding. The second approach is self-adaptive VAD, which fine-tunes
the pre-trained VAD on the speaker verification data to reduce the domain
mismatch. Here, we introduce two unsupervised domain adaptation (DA) schemes,
namely speech posterior-based DA (SP-DA) and joint learning-based DA (JL-DA).
Experiments on a Korean speech database demonstrate that the verification
performance is improved significantly in real-world environments by using
self-adaptive soft VAD.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
- …