121,164 research outputs found

    Challenges of developing a digital scribe to reduce clinical documentation burden.

    Full text link
    Clinicians spend a large amount of time on clinical documentation of patient encounters, often impacting quality of care and clinician satisfaction, and causing physician burnout. Advances in artificial intelligence (AI) and machine learning (ML) open the possibility of automating clinical documentation with digital scribes, using speech recognition to eliminate manual documentation by clinicians or medical scribes. However, developing a digital scribe is fraught with problems due to the complex nature of clinical environments and clinical conversations. This paper identifies and discusses major challenges associated with developing automated speech-based documentation in clinical settings: recording high-quality audio, converting audio to transcripts using speech recognition, inducing topic structure from conversation data, extracting medical concepts, generating clinically meaningful summaries of conversations, and obtaining clinical data for AI and ML algorithms

    End-to-end speech recognition modeling from de-identified data

    Full text link
    De-identification of data used for automatic speech recognition modeling is a critical component in protecting privacy, especially in the medical domain. However, simply removing all personally identifiable information (PII) from end-to-end model training data leads to a significant performance degradation in particular for the recognition of names, dates, locations, and words from similar categories. We propose and evaluate a two-step method for partially recovering this loss. First, PII is identified, and each occurrence is replaced with a random word sequence of the same category. Then, corresponding audio is produced via text-to-speech or by splicing together matching audio fragments extracted from the corpus. These artificial audio/label pairs, together with speaker turns from the original data without PII, are used to train models. We evaluate the performance of this method on in-house data of medical conversations and observe a recovery of almost the entire performance degradation in the general word error rate while still maintaining a strong diarization performance. Our main focus is the improvement of recall and precision in the recognition of PII-related words. Depending on the PII category, between 50%−90%50\% - 90\% of the performance degradation can be recovered using our proposed method.Comment: Accepted to INTERSPEECH 202

    Effective Detection of Local Languages for Tourists Based on Surrounding Features

    Get PDF
    The tourism industry is a trillion-dollar industry with many governments investing heavily in making their countries attractive enough to entice potential visitors. People engage in tourism due to different reasons which could range from business, education, leisure, medical or ancestral reasons. Communication between intending visitors and locals is essential, given the non-homogeneity that occurs across cultures and borders. In this paper, we focus on developing a cross-platform mobile application that listens to surrounding conversations, is able to pick certain keywords, automatically switch to the local language of its location and then offer translation capabilities to facilitate conversations. To implement this, we depend on the Google translate API for the translation capabilities of the application, starting with the English language as our base language. To provide the input (speech) for translation, we solely employ speech recognition software using the Speech-to-Text package available on Flutter. The output with the correct pronunciation (and local accent) of the translation is done with the Text-to-Speech package. If the application does not recognize any keywords, the local language can be determined using the geographical parameters of the user. Finally, we utilize the cross-platform competence of the Flutter software development kit and the Dart programming language to build the application

    ASR Error Detection via Audio-Transcript entailment

    Full text link
    Despite improved performances of the latest Automatic Speech Recognition (ASR) systems, transcription errors are still unavoidable. These errors can have a considerable impact in critical domains such as healthcare, when used to help with clinical documentation. Therefore, detecting ASR errors is a critical first step in preventing further error propagation to downstream applications. To this end, we propose a novel end-to-end approach for ASR error detection using audio-transcript entailment. To the best of our knowledge, we are the first to frame this problem as an end-to-end entailment task between the audio segment and its corresponding transcript segment. Our intuition is that there should be a bidirectional entailment between audio and transcript when there is no recognition error and vice versa. The proposed model utilizes an acoustic encoder and a linguistic encoder to model the speech and transcript respectively. The encoded representations of both modalities are fused to predict the entailment. Since doctor-patient conversations are used in our experiments, a particular emphasis is placed on medical terms. Our proposed model achieves classification error rates (CER) of 26.2% on all transcription errors and 23% on medical errors specifically, leading to improvements upon a strong baseline by 12% and 15.4%, respectively.Comment: Accepted to Interspeech 202

    Interpersonal prosodic correlation in frontotemporal dementia.

    Get PDF
    Communication accommodation describes how individuals adjust their communicative style to that of their conversational partner. We predicted that interpersonal prosodic correlation related to pitch and timing would be decreased in behavioral variant frontotemporal dementia (bvFTD). We predicted that the interpersonal correlation in a timing measure and a pitch measure would be increased in right temporal FTD (rtFTD) due to sparing of the neural substrate for speech timing and pitch modulation but loss of social semantics. We found no significant effects in bvFTD, but conversations including rtFTD demonstrated higher interpersonal correlations in speech rate than healthy controls

    An End-to-End Conversational Style Matching Agent

    Full text link
    We present an end-to-end voice-based conversational agent that is able to engage in naturalistic multi-turn dialogue and align with the interlocutor's conversational style. The system uses a series of deep neural network components for speech recognition, dialogue generation, prosodic analysis and speech synthesis to generate language and prosodic expression with qualities that match those of the user. We conducted a user study (N=30) in which participants talked with the agent for 15 to 20 minutes, resulting in over 8 hours of natural interaction data. Users with high consideration conversational styles reported the agent to be more trustworthy when it matched their conversational style. Whereas, users with high involvement conversational styles were indifferent. Finally, we provide design guidelines for multi-turn dialogue interactions using conversational style adaptation
    • …
    corecore