19,586 research outputs found

    Разработка структуры текстонезависимой системы идентификации диктора

    Get PDF
    В статье рассмотрены основные технологии, используемые при создании систем идентификации диктора, и трудности, с которыми сталкиваются их разработчики. Предложена структура системы текстонезависи- мой идентификации диктора, использующая автоматическую дикторонезависимую сегментацию речевого сигнала с одновременной классификацией сегментов. Такой подход повышает точность модели диктора и нивелирует разногласие между обучающим и распознаваемым контекстом.У статті розглянуті основні технології, що використовуються при створенні систем ідентифікації диктора, і труднощі, з якими стикаються їх розробники. Запропоновано структуру системи текстонезалежної ідентифікації диктора, що використовує автоматичну дикторонезалежну сегментацію мовного сигналу з одночасною класифікацією сегментів. Такий підхід підвищує точність моделі диктора і нівелює суперечність між навчальним і розпізнавальним контекстом.In the article, principal technologies used in the creation of speaker identification systems and difficulties faced by their developers are considered. The structure of text-independent speaker identification using automatic segmentation of speech signal with simultaneous speaker-independent classification of segments is proposed. This approach improves accuracy of the speaker model and eliminates disagreement between training and recognizable context

    Utilising Tree-Based Ensemble Learning for Speaker Segmentation

    Get PDF
    Part 2: Learning-Ensemble LearningInternational audienceIn audio and speech processing, accurate detection of the changing points between multiple speakers in speech segments is an important stage for several applications such as speaker identification and tracking. Bayesian Information Criteria (BIC)-based approaches are the most traditionally used ones as they proved to be very effective for such task. The main criticism levelled against BIC-based approaches is the use of a penalty parameter in the BIC function. The use of this parameters consequently means that a fine tuning is required for each variation of the acoustic conditions. When tuned for a certain condition, the model becomes biased to the data used for training limiting the model’s generalisation ability.In this paper, we propose a BIC-based tuning-free approach for speaker segmentation through the use of ensemble-based learning. A forest of segmentation trees is constructed in which each tree is trained using a sampled version of the speech segment. During the tree construction process, a set of randomly selected points in the input sequence is examined as potential segmentation points. The point that yields the highest ΔBIC is chosen and the same process is repeated for the resultant left and right segments. The tree is constructed where each node corresponds to the highest ΔBIC with the associated point index. After building the forest and using all trees, the accumulated ΔBIC for each point is calculated and the positions of the local maximums are considered as speaker changing points. The proposed approach is tested on artificially created conversations from the TIMIT database. The approach proposed show very accurate results comparable to those achieved by the-state-of-the-art methods with a 9% (absolute) higher F1 compared with the standard ΔBIC with optimally tuned penalty parameter

    Use of vocal source features in speaker segmentation.

    Get PDF
    Chan Wai Nang.Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.Includes bibliographical references (leaves 77-82).Abstracts in English and Chinese.Chapter Chapter1 --- Introduction --- p.1Chapter 1.1 --- Speaker recognition --- p.1Chapter 1.2 --- State of the art of speaker recognition techniques --- p.2Chapter 1.3 --- Motivations --- p.5Chapter 1.4 --- Thesis outline --- p.6Chapter Chapter2 --- Acoustic Features --- p.8Chapter 2.1 --- Speech production --- p.8Chapter 2.1.1 --- Physiology of speech production --- p.8Chapter 2.1.2 --- Source-filter model --- p.11Chapter 2.2 --- Vocal tract and vocal source related acoustic features --- p.14Chapter 2.3 --- Linear predictive analysis of speech --- p.15Chapter 2.4 --- Features for speaker recognition --- p.16Chapter 2.4.1 --- Vocal tract related features --- p.17Chapter 2.4.2 --- Vocal source related features --- p.19Chapter 2.5 --- Wavelet octave coefficients of residues (WOCOR) --- p.20Chapter Chapter3 --- Statistical approaches to speaker recognition --- p.24Chapter 3.1 --- Statistical modeling --- p.24Chapter 3.1.1 --- Classification and modeling --- p.24Chapter 3.1.2 --- Parametric vs non-parametric --- p.25Chapter 3.1.3 --- Gaussian mixture model (GMM) --- p.25Chapter 3.1.4 --- Model estimation --- p.27Chapter 3.2 --- Classification --- p.28Chapter 3.2.1 --- Multi-class classification for speaker identification --- p.28Chapter 3.2.2 --- Two-speaker recognition --- p.29Chapter 3.2.3 --- Model selection by statistical model --- p.30Chapter 3.2.4 --- Performance evaluation metric --- p.31Chapter Chapter4 --- Content dependency study of WOCOR and MFCC --- p.32Chapter 4.1 --- Database: CU2C --- p.32Chapter 4.2 --- Methods and procedures --- p.33Chapter 4.3 --- Experimental results --- p.35Chapter 4.4 --- Discussion --- p.36Chapter 4.5 --- Detailed analysis --- p.39Summary --- p.41Chapter Chapter5 --- Speaker Segmentation --- p.43Chapter 5.1 --- Feature extraction --- p.43Chapter 5.2 --- Statistical methods for segmentation and clustering --- p.44Chapter 5.2.1 --- Segmentation by spectral difference --- p.44Chapter 5.2.2 --- Segmentation by Bayesian information criterion (BIC) --- p.47Chapter 5.2.3 --- Segment clustering by BIC --- p.49Chapter 5.3 --- Baseline system --- p.50Chapter 5.3.1 --- Algorithm --- p.50Chapter 5.3.2 --- Speech database --- p.52Chapter 5.3.3 --- Performance metric --- p.53Chapter 5.3.4 --- Results --- p.58Summary --- p.60Chapter Chapter6 --- Application of vocal source features in speaker segmentation --- p.61Chapter 6.1 --- Discrimination power of WOCOR against MFCC --- p.61Chapter 6.1.1 --- Experimental set-up --- p.62Chapter 6.1.2 --- Results --- p.63Chapter 6.2 --- Speaker segmentation using vocal source features --- p.67Chapter 6.2.1 --- The construction of new proposed system --- p.67Summary --- p.72Chapter Chapter7 --- Conclusions --- p.74Reference --- p.7

    Speaker segmentation and clustering

    Get PDF
    This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved

    Computationally Efficient and Robust BIC-Based Speaker Segmentation

    Get PDF
    An algorithm for automatic speaker segmentation based on the Bayesian information criterion (BIC) is presented. BIC tests are not performed for every window shift, as previously, but when a speaker change is most probable to occur. This is done by estimating the next probable change point thanks to a model of utterance durations. It is found that the inverse Gaussian fits best the distribution of utterance durations. As a result, less BIC tests are needed, making the proposed system less computationally demanding in time and memory, and considerably more efficient with respect to missed speaker change points. A feature selection algorithm based on branch and bound search strategy is applied in order to identify the most efficient features for speaker segmentation. Furthermore, a new theoretical formulation of BIC is derived by applying centering and simultaneous diagonalization. This formulation is considerably more computationally efficient than the standard BIC, when the covariance matrices are estimated by other estimators than the usual maximum-likelihood ones. Two commonly used pairs of figures of merit are employed and their relationship is established. Computational efficiency is achieved through the speaker utterance modeling, whereas robustness is achieved by feature selection and application of BIC tests at appropriately selected time instants. Experimental results indicate that the proposed modifications yield a superior performance compared to existing approaches

    A Novel Method For Speech Segmentation Based On Speakers' Characteristics

    Full text link
    Speech Segmentation is the process change point detection for partitioning an input audio stream into regions each of which corresponds to only one audio source or one speaker. One application of this system is in Speaker Diarization systems. There are several methods for speaker segmentation; however, most of the Speaker Diarization Systems use BIC-based Segmentation methods. The main goal of this paper is to propose a new method for speaker segmentation with higher speed than the current methods - e.g. BIC - and acceptable accuracy. Our proposed method is based on the pitch frequency of the speech. The accuracy of this method is similar to the accuracy of common speaker segmentation methods. However, its computation cost is much less than theirs. We show that our method is about 2.4 times faster than the BIC-based method, while the average accuracy of pitch-based method is slightly higher than that of the BIC-based method.Comment: 14 pages, 8 figure

    Production and perception of speaker-specific phonetic detail at word boundaries

    Get PDF
    Experiments show that learning about familiar voices affects speech processing in many tasks. However, most studies focus on isolated phonemes or words and do not explore which phonetic properties are learned about or retained in memory. This work investigated inter-speaker phonetic variation involving word boundaries, and its perceptual consequences. A production experiment found significant variation in the extent to which speakers used a number of acoustic properties to distinguish junctural minimal pairs e.g. 'So he diced them'—'So he'd iced them'. A perception experiment then tested intelligibility in noise of the junctural minimal pairs before and after familiarisation with a particular voice. Subjects who heard the same voice during testing as during the familiarisation period showed significantly more improvement in identification of words and syllable constituents around word boundaries than those who heard different voices. These data support the view that perceptual learning about the particular pronunciations associated with individual speakers helps listeners to identify syllabic structure and the location of word boundaries
    corecore