1,259 research outputs found
Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation
We present a probabilistic model that uses both prosodic and lexical cues for
the automatic segmentation of speech into topically coherent units. We propose
two methods for combining lexical and prosodic information using hidden Markov
models and decision trees. Lexical information is obtained from a speech
recognizer, and prosodic features are extracted automatically from speech
waveforms. We evaluate our approach on the Broadcast News corpus, using the
DARPA-TDT evaluation metrics. Results show that the prosodic model alone is
competitive with word-based segmentation methods. Furthermore, we achieve a
significant reduction in error by combining the prosodic and word-based
knowledge sources.Comment: 27 pages, 8 figure
Grounding semantics in robots for Visual Question Answering
In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning
Linguistically Aided Speaker Diarization Using Speaker Role Information
Speaker diarization relies on the assumption that speech segments
corresponding to a particular speaker are concentrated in a specific region of
the speaker space; a region which represents that speaker's identity. These
identities are not known a priori, so a clustering algorithm is typically
employed, which is traditionally based solely on audio. Under noisy conditions,
however, such an approach poses the risk of generating unreliable speaker
clusters. In this work we aim to utilize linguistic information as a
supplemental modality to identify the various speakers in a more robust way. We
are focused on conversational scenarios where the speakers assume distinct
roles and are expected to follow different linguistic patterns. This distinct
linguistic variability can be exploited to help us construct the speaker
identities. That way, we are able to boost the diarization performance by
converting the clustering task to a classification one. The proposed method is
applied in real-world dyadic psychotherapy interactions between a provider and
a patient and demonstrated to show improved results.Comment: from v1: restructured Introduction and Background, added experimental
results with ASR text and language-only baselin
Recommended from our members
Automated CT and MRI Liver Segmentation and Biometry Using a Generalized Convolutional Neural Network.
PurposeTo assess feasibility of training a convolutional neural network (CNN) to automate liver segmentation across different imaging modalities and techniques used in clinical practice and apply this to enable automation of liver biometry.MethodsWe trained a 2D U-Net CNN for liver segmentation in two stages using 330 abdominal MRI and CT exams acquired at our institution. First, we trained the neural network with non-contrast multi-echo spoiled-gradient-echo (SGPR)images with 300 MRI exams to provide multiple signal-weightings. Then, we used transfer learning to generalize the CNN with additional images from 30 contrast-enhanced MRI and CT exams.We assessed the performance of the CNN using a distinct multi-institutional data set curated from multiple sources (n = 498 subjects). Segmentation accuracy was evaluated by computing Dice scores. Utilizing these segmentations, we computed liver volume from CT and T1-weighted (T1w) MRI exams, and estimated hepatic proton- density-fat-fraction (PDFF) from multi-echo T2*w MRI exams. We compared quantitative volumetry and PDFF estimates between automated and manual segmentation using Pearson correlation and Bland-Altman statistics.ResultsDice scores were 0.94 ± 0.06 for CT (n = 230), 0.95 ± 0.03 (n = 100) for T1w MR, and 0.92 ± 0.05 for T2*w MR (n = 169). Liver volume measured by manual and automated segmentation agreed closely for CT (95% limit-of-agreement (LoA) = [-298 mL, 180 mL]) and T1w MR (LoA = [-358 mL, 180 mL]). Hepatic PDFF measured by the two segmentations also agreed closely (LoA = [-0.62%, 0.80%]).ConclusionsUtilizing a transfer-learning strategy, we have demonstrated the feasibility of a CNN to be generalized to perform liver segmentations across different imaging techniques and modalities. With further refinement and validation, CNNs may have broad applicability for multimodal liver volumetry and hepatic tissue characterization
Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model
Some practical uses of ASR have been implemented, including the transcription of meetings and the usage of smart speakers. It is the process by which speech waves are transformed into text that allows computers to interpret and act upon human speech. Scalable strategies for developing ASR systems in languages where no voice transcriptions or pronunciation dictionaries exist are the primary focus of this work. We first show that the necessity for voice transcription into the target language can be greatly reduced through cross-lingual acoustic model transfer when phonemic pronunciation lexicons exist in the new language. Afterwards, we investigate three approaches to dealing with languages that lack a pronunciation lexicon. Secondly, we have a look at the efficiency of graphemic acoustic model transfer, which makes it easy to build pronunciation dictionaries. Thesis problems can be solved, in part, by investigating optimization strategies for training on huge corpora (such as GA+HMM and DE+HMM). In the training phase of acoustic modelling, the suggested method is applied to traditional methods. Read speech and HMI voice experiments indicated that while each data augmentation strategy alone did not always increase recognition performance, using all three techniques together did. Power normalised cepstral coefficient (PNCC) features are tweaked somewhat in this work to enhance verification accuracy. To increase speaker verification accuracy, we suggest employing multiple “Gaussian Mixture Model-Universal Background Model (GMM-UBM) and SVM classifiers”. Importantly, pitch shift data augmentation and multi-task training reduced bias by more than 18% absolute compared to the baseline system for read speech, and applying all three data augmentation techniques during fine tuning reduced bias by more than 7% for HMI speech, while increasing recognition accuracy of both native and non-native Dutch speech
Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos
We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (\u3c 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as Good job , Thank you , Please , and You\u27re welcome , as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language
- …