121,164 research outputs found
Challenges of developing a digital scribe to reduce clinical documentation burden.
Clinicians spend a large amount of time on clinical documentation of patient encounters, often impacting quality of care and clinician satisfaction, and causing physician burnout. Advances in artificial intelligence (AI) and machine learning (ML) open the possibility of automating clinical documentation with digital scribes, using speech recognition to eliminate manual documentation by clinicians or medical scribes. However, developing a digital scribe is fraught with problems due to the complex nature of clinical environments and clinical conversations. This paper identifies and discusses major challenges associated with developing automated speech-based documentation in clinical settings: recording high-quality audio, converting audio to transcripts using speech recognition, inducing topic structure from conversation data, extracting medical concepts, generating clinically meaningful summaries of conversations, and obtaining clinical data for AI and ML algorithms
End-to-end speech recognition modeling from de-identified data
De-identification of data used for automatic speech recognition modeling is a
critical component in protecting privacy, especially in the medical domain.
However, simply removing all personally identifiable information (PII) from
end-to-end model training data leads to a significant performance degradation
in particular for the recognition of names, dates, locations, and words from
similar categories. We propose and evaluate a two-step method for partially
recovering this loss. First, PII is identified, and each occurrence is replaced
with a random word sequence of the same category. Then, corresponding audio is
produced via text-to-speech or by splicing together matching audio fragments
extracted from the corpus. These artificial audio/label pairs, together with
speaker turns from the original data without PII, are used to train models. We
evaluate the performance of this method on in-house data of medical
conversations and observe a recovery of almost the entire performance
degradation in the general word error rate while still maintaining a strong
diarization performance. Our main focus is the improvement of recall and
precision in the recognition of PII-related words. Depending on the PII
category, between of the performance degradation can be recovered
using our proposed method.Comment: Accepted to INTERSPEECH 202
Effective Detection of Local Languages for Tourists Based on Surrounding Features
The tourism industry is a trillion-dollar industry with many governments investing heavily in making their countries attractive enough to entice potential visitors. People engage in tourism due to different reasons which could range from business, education, leisure, medical or ancestral reasons. Communication between intending visitors and locals is essential, given the non-homogeneity that occurs across cultures and borders. In this paper, we focus on developing a cross-platform mobile application that listens to surrounding conversations, is able to pick certain keywords, automatically switch to the local language of its location and then offer translation capabilities to facilitate conversations. To implement this, we depend on the Google translate API for the translation capabilities of the application, starting with the English language as our base language. To provide the input (speech) for translation, we solely employ speech recognition software using the Speech-to-Text package available on Flutter. The output with the correct pronunciation (and local accent) of the translation is done with the Text-to-Speech package. If the application does not recognize any keywords, the local language can be determined using the geographical parameters of the user. Finally, we utilize the cross-platform competence of the Flutter software development kit and the Dart programming language to build the application
ASR Error Detection via Audio-Transcript entailment
Despite improved performances of the latest Automatic Speech Recognition
(ASR) systems, transcription errors are still unavoidable. These errors can
have a considerable impact in critical domains such as healthcare, when used to
help with clinical documentation. Therefore, detecting ASR errors is a critical
first step in preventing further error propagation to downstream applications.
To this end, we propose a novel end-to-end approach for ASR error detection
using audio-transcript entailment. To the best of our knowledge, we are the
first to frame this problem as an end-to-end entailment task between the audio
segment and its corresponding transcript segment. Our intuition is that there
should be a bidirectional entailment between audio and transcript when there is
no recognition error and vice versa. The proposed model utilizes an acoustic
encoder and a linguistic encoder to model the speech and transcript
respectively. The encoded representations of both modalities are fused to
predict the entailment. Since doctor-patient conversations are used in our
experiments, a particular emphasis is placed on medical terms. Our proposed
model achieves classification error rates (CER) of 26.2% on all transcription
errors and 23% on medical errors specifically, leading to improvements upon a
strong baseline by 12% and 15.4%, respectively.Comment: Accepted to Interspeech 202
Interpersonal prosodic correlation in frontotemporal dementia.
Communication accommodation describes how individuals adjust their communicative style to that of their conversational partner. We predicted that interpersonal prosodic correlation related to pitch and timing would be decreased in behavioral variant frontotemporal dementia (bvFTD). We predicted that the interpersonal correlation in a timing measure and a pitch measure would be increased in right temporal FTD (rtFTD) due to sparing of the neural substrate for speech timing and pitch modulation but loss of social semantics. We found no significant effects in bvFTD, but conversations including rtFTD demonstrated higher interpersonal correlations in speech rate than healthy controls
An End-to-End Conversational Style Matching Agent
We present an end-to-end voice-based conversational agent that is able to
engage in naturalistic multi-turn dialogue and align with the interlocutor's
conversational style. The system uses a series of deep neural network
components for speech recognition, dialogue generation, prosodic analysis and
speech synthesis to generate language and prosodic expression with qualities
that match those of the user. We conducted a user study (N=30) in which
participants talked with the agent for 15 to 20 minutes, resulting in over 8
hours of natural interaction data. Users with high consideration conversational
styles reported the agent to be more trustworthy when it matched their
conversational style. Whereas, users with high involvement conversational
styles were indifferent. Finally, we provide design guidelines for multi-turn
dialogue interactions using conversational style adaptation
- …