5,474 research outputs found

    Evaluating automatic speaker recognition systems: an overview of the nist speaker recognition evaluations (1996-2014)

    Get PDF
    2014 CSIC. Manuscripts published in this Journal are the property of the Consejo Superior de Investigaciones Científicas, and quoting this source is a requirement for any partial or full reproduction.Automatic Speaker Recognition systems show interesting properties, such as speed of processing or repeatability of results, in contrast to speaker recognition by humans. But they will be usable just if they are reliable. Testability, or the ability to extensively evaluate the goodness of the speaker detector decisions, becomes then critical. In the last 20 years, the US National Institute of Standards and Technology (NIST) has organized, providing the proper speech data and evaluation protocols, a series of text-independent Speaker Recognition Evaluations (SRE). Those evaluations have become not just a periodical benchmark test, but also a meeting point of a collaborative community of scientists that have been deeply involved in the cycle of evaluations, allowing tremendous progress in a specially complex task where the speaker information is spread across different information levels (acoustic, prosodic, linguistic…) and is strongly affected by speaker intrinsic and extrinsic variability factors. In this paper, we outline how the evaluations progressively challenged the technology including new speaking conditions and sources of variability, and how the scientific community gave answers to those demands. Finally, NIST SREs will be shown to be not free of inconveniences, and future challenges to speaker recognition assessment will also be discussed

    Local representations and random sampling for speaker verification

    Get PDF
    In text-independent speaker verification, studies focused on compensating intra-speaker variabilities at the modeling stage through the last decade. Intra-speaker variabilities may be due to channel effects, phonetic content or the speaker himself in the form of speaking style, emotional state, health or other similar factors. Joint Factor Analysis, Total Variability Space compensation, Nuisance Attribute Projection are some of the most successful approaches for inter-session variability compensation in the literature. In this thesis, we criticize the assumptions of low dimensionality of channel space in these methods and propose to partition the acoustic space into local regions. Intra-speaker variability compensation may be done in each local space separately. Two architectures are proposed depending on whether the subsequent modeling and scoring steps will also be done locally or globally. We have also focused on a particular component of intra-speaker variability, namely within-session variability. The main source of within-session variability is the differences in the phonetic content of speech segments in a single utterance. The variabilities in phonetic content may be either due to across acoustic event variabilities or due to differences in the actual realizations of the acoustic events. We propose a method to combat these variabilities through random sampling of training utterance. The method is shown to be effective both in short and long test utterances

    Development of a digital biomarker and intervention for subclinical depression: study protocol for a longitudinal waitlist control study

    Full text link
    Background Depression remains a global health problem, with its prevalence rising worldwide. Digital biomarkers are increasingly investigated to initiate and tailor scalable interventions targeting depression. Due to the steady influx of new cases, focusing on treatment alone will not suffice; academics and practitioners need to focus on the prevention of depression (i.e., addressing subclinical depression). Aim With our study, we aim to (i) develop digital biomarkers for subclinical symptoms of depression, (ii) develop digital biomarkers for severity of subclinical depression, and (iii) investigate the efficacy of a digital intervention in reducing symptoms and severity of subclinical depression. Method Participants will interact with the digital intervention BEDDA consisting of a scripted conversational agent, the slow-paced breathing training Breeze, and actionable advice for different symptoms. The intervention comprises 30 daily interactions to be completed in less than 45 days. We will collect self-reports regarding mood, agitation, anhedonia (proximal outcomes; first objective), self-reports regarding depression severity (primary distal outcome; second and third objective), anxiety severity (secondary distal outcome; second and third objective), stress (secondary distal outcome; second and third objective), voice, and breathing. A subsample of 25% of the participants will use smartwatches to record physiological data (e.g., heart-rate, heart-rate variability), which will be used in the analyses for all three objectives. Discussion Digital voice- and breathing-based biomarkers may improve diagnosis, prevention, and care by enabling an unobtrusive and either complementary or alternative assessment to self-reports. Furthermore, our results may advance our understanding of underlying psychophysiological changes in subclinical depression. Our study also provides further evidence regarding the efficacy of standalone digital health interventions to prevent depression. Trial registration Ethics approval was provided by the Ethics Commission of ETH Zurich (EK-2022-N-31) and the study was registered in the ISRCTN registry (Reference number: ISRCTN38841716, Submission date: 20/08/2022)

    Quantitative Differences in the Conversational Performance of People with Severe Expressive Aphasia Using Three Types of Visual Screen Displays on Speech Generating Devices

    Get PDF
    This multiple single-subject research study measured quantitative differences in communication success, communicator roles and act functions during dyadic conversational interactions between six people with severe aphasia and their peer communication partners across three conditions involving a type of augmentative communication intervention, speech generating devices (SGDs). Researchers assessed these variables across four conditions involving the message display of the SGD: no display (Condition A), visual scenes (contextual photographic) display (Condition B), Traditional Grid Display (Condition C), while participants engage in conversational story telling. This study is important because technology is currently being developed to assist people with aphasia to access messages stored on an SGD by activating photographic representations that access a set of spoken messages that are related to the photo. This contrasts with a more traditional method of representing messages, in which decontextual line drawings associated with individual concepts are displayed on the screen. Results from this study indicate that interactions between peer communication partners and people with aphasia can and do benefit from external, symbolic representation of messages on AAC devices. However, an unexpected finding was that given too much contextual information as with visual scenes, peer communication partners can deduce the content and context of the story, thereby being more apt to dominate the conversation than they are with no display

    Using visual scene displays to create a shared communication space for a person with aphasia

    Get PDF
    Background: Low-tech visual scene displays (VSDs) combine contextually rich pictures and written text to support the communication of people with aphasia. VSDs create a shared communication space in which a person with aphasia and a communication partner co-construct messages. Aims: The researchers examined the effect of low-tech VSDs on the content and quality of communicative interactions between a person with aphasia and unfamiliar communication partners. Methods & Procedures: One person with aphasia and nine unfamiliar communication partners engaged in short, one-on-one conversations about a specified topic in one of three conditions: shared-VSDs, non-shared-VSDs, and no-VSDs. Data included discourse analysis scores reflecting the conceptual complexity of utterances, content unit analyses of information communication partners gathered from the interaction, and Likert-scale responses from the person with aphasia about his perception of communicative ease and effectiveness. Outcomes & Results: Comparisons made across conditions revealed: (a) the most conversational turns occurred in the shared-VSDs condition; (b) communication partners produced utterances with higher conceptual complexity in the shared-VSDs condition; (c) the person with aphasia conveyed the greatest number of content units in the shared- VSDs condition; and (d) the person with aphasia perceived that information transfer, ease of conversational interaction, and partner understanding were best in the shared-VSDs condition. Conclusions: These findings suggest that low-tech VSDs have an impact on the manner and extent to which a person with aphasia and a communication partner contribute to conversational interactions involving information transfe

    The impact of the Lombard effect on audio and visual speech recognition systems

    Get PDF
    When producing speech in noisy backgrounds talkers reflexively adapt their speaking style in ways that increase speech-in-noise intelligibility. This adaptation, known as the Lombard effect, is likely to have an adverse effect on the performance of automatic speech recognition systems that have not been designed to anticipate it. However, previous studies of this impact have used very small amounts of data and recognition systems that lack modern adaptation strategies. This paper aims to rectify this by using a new audio-visual Lombard corpus containing speech from 54 different speakers – significantly larger than any previously available – and modern state-of-the-art speech recognition techniques. The paper is organised as three speech-in-noise recognition studies. The first examines the case in which a system is presented with Lombard speech having been exclusively trained on normal speech. It was found that the Lombard mismatch caused a significant decrease in performance even if the level of the Lombard speech was normalised to match the level of normal speech. However, the size of the mismatch was highly speaker-dependent thus explaining conflicting results presented in previous smaller studies. The second study compares systems trained in matched conditions (i.e., training and testing with the same speaking style). Here the Lombard speech affords a large increase in recognition performance. Part of this is due to the greater energy leading to a reduction in noise masking, but performance improvements persist even after the effect of signal-to-noise level difference is compensated. An analysis across speakers shows that the Lombard speech energy is spectro-temporally distributed in a way that reduces energetic masking, and this reduction in masking is associated with an increase in recognition performance. The final study repeats the first two using a recognition system training on visual speech. In the visual domain, performance differences are not confounded by differences in noise masking. It was found that in matched-conditions Lombard speech supports better recognition performance than normal speech. The benefit was consistently present across all speakers but to a varying degree. Surprisingly, the Lombard benefit was observed to a small degree even when training on mismatched non-Lombard visual speech, i.e., the increased clarity of the Lombard speech outweighed the impact of the mismatch. The paper presents two generally applicable conclusions: i) systems that are designed to operate in noise will benefit from being trained on well-matched Lombard speech data, ii) the results of speech recognition evaluations that employ artificial speech and noise mixing need to be treated with caution: they are overly-optimistic to the extent that they ignore a significant source of mismatch but at the same time overly-pessimistic in that they do not anticipate the potential increased intelligibility of the Lombard speaking style

    Multimodal Egocentric Analysis of Focused Interactions

    Get PDF
    Continuous detection of social interactions from wearable sensor data streams has a range of potential applications in domains, including health and social care, security, and assistive technology. We contribute an annotated, multimodal data set capturing such interactions using video, audio, GPS, and inertial sensing. We present methods for automatic detection and temporal segmentation of focused interactions using support vector machines and recurrent neural networks with features extracted from both audio and video streams. The focused interaction occurs when the co-present individuals, having the mutual focus of attention, interact by first establishing the face-to-face engagement and direct conversation. We describe an evaluation protocol, including framewise, extended framewise, and event-based measures, and provide empirical evidence that the fusion of visual face track scores with audio voice activity scores provides an effective combination. The methods, contributed data set, and protocol together provide a benchmark for the future research on this problem
    • …
    corecore