Search CORE

1,259 research outputs found

Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

Author: Andreas Stolcke
Dilek Hakkani-Tür
Elizabeth Shriberg
Grosz B.
Gökhan Tür
Hearst Marti A
Passonneau Rebecca J
Publication venue
Publication date: 01/01/2000
Field of study

We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.Comment: 27 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository

Grounding semantics in robots for Visual Question Answering

Author: Wahle Björn
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

In this thesis I describe an operational implementation of an object detection and description system that incorporates in an end-to-end Visual Question Answering system and evaluated it on two visual question answering datasets for compositional language and elementary visual reasoning

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Linguistically Aided Speaker Diarization Using Speaker Role Information

Author: Flemotomos Nikolaos
Georgiou Panayiotis
Narayanan Shrikanth
Publication venue: 'International Speech Communication Association'
Publication date: 05/02/2020
Field of study

Speaker diarization relies on the assumption that speech segments corresponding to a particular speaker are concentrated in a specific region of the speaker space; a region which represents that speaker's identity. These identities are not known a priori, so a clustering algorithm is typically employed, which is traditionally based solely on audio. Under noisy conditions, however, such an approach poses the risk of generating unreliable speaker clusters. In this work we aim to utilize linguistic information as a supplemental modality to identify the various speakers in a more robust way. We are focused on conversational scenarios where the speakers assume distinct roles and are expected to follow different linguistic patterns. This distinct linguistic variability can be exploited to help us construct the speaker identities. That way, we are able to boost the diarization performance by converting the clustering task to a classification one. The proposed method is applied in real-world dyadic psychotherapy interactions between a provider and a patient and demonstrated to show improved results.Comment: from v1: restructured Introduction and Background, added experimental results with ASR text and language-only baselin

arXiv.org e-Print Archive

Crossref

Recommended from our members

Automated CT and MRI Liver Segmentation and Biometry Using a Generalized Convolutional Neural Network.

Author: Bahrami Naeim
Bass Emily
Blansit Kevin
Cunha Guilherme
Delgado Timoteo
Hasenstab Kyle
Hsiao Albert
Loomba Rohit
Mamidipalli Adrija
members of the NASH Clinical Research Network
Middleton Michael S
Neuschwander-Tetri Brent A
Retson Tara
Sirlin Claude B
Wang Kang
Publication venue: eScholarship, University of California
Publication date: 01/03/2019
Field of study

PurposeTo assess feasibility of training a convolutional neural network (CNN) to automate liver segmentation across different imaging modalities and techniques used in clinical practice and apply this to enable automation of liver biometry.MethodsWe trained a 2D U-Net CNN for liver segmentation in two stages using 330 abdominal MRI and CT exams acquired at our institution. First, we trained the neural network with non-contrast multi-echo spoiled-gradient-echo (SGPR)images with 300 MRI exams to provide multiple signal-weightings. Then, we used transfer learning to generalize the CNN with additional images from 30 contrast-enhanced MRI and CT exams.We assessed the performance of the CNN using a distinct multi-institutional data set curated from multiple sources (n = 498 subjects). Segmentation accuracy was evaluated by computing Dice scores. Utilizing these segmentations, we computed liver volume from CT and T1-weighted (T1w) MRI exams, and estimated hepatic proton- density-fat-fraction (PDFF) from multi-echo T2*w MRI exams. We compared quantitative volumetry and PDFF estimates between automated and manual segmentation using Pearson correlation and Bland-Altman statistics.ResultsDice scores were 0.94 ± 0.06 for CT (n = 230), 0.95 ± 0.03 (n = 100) for T1w MR, and 0.92 ± 0.05 for T2*w MR (n = 169). Liver volume measured by manual and automated segmentation agreed closely for CT (95% limit-of-agreement (LoA) = [-298 mL, 180 mL]) and T1w MR (LoA = [-358 mL, 180 mL]). Hepatic PDFF measured by the two segmentations also agreed closely (LoA = [-0.62%, 0.80%]).ConclusionsUtilizing a transfer-learning strategy, we have demonstrated the feasibility of a CNN to be generalized to perform liver segmentations across different imaging techniques and modalities. With further refinement and validation, CNNs may have broad applicability for multimodal liver volumetry and hepatic tissue characterization

eScholarship - University of California

Development of efficient techniques for ASR System for Speech Detection and Recognization system using Gaussian Mixture Model- Universal Background Model

Author: Veera V Rama Rao M et al.
Publication venue: Auricle Global Society of Education and Research
Publication date: 02/11/2023
Field of study

Some practical uses of ASR have been implemented, including the transcription of meetings and the usage of smart speakers. It is the process by which speech waves are transformed into text that allows computers to interpret and act upon human speech. Scalable strategies for developing ASR systems in languages where no voice transcriptions or pronunciation dictionaries exist are the primary focus of this work. We first show that the necessity for voice transcription into the target language can be greatly reduced through cross-lingual acoustic model transfer when phonemic pronunciation lexicons exist in the new language. Afterwards, we investigate three approaches to dealing with languages that lack a pronunciation lexicon. Secondly, we have a look at the efficiency of graphemic acoustic model transfer, which makes it easy to build pronunciation dictionaries. Thesis problems can be solved, in part, by investigating optimization strategies for training on huge corpora (such as GA+HMM and DE+HMM). In the training phase of acoustic modelling, the suggested method is applied to traditional methods. Read speech and HMI voice experiments indicated that while each data augmentation strategy alone did not always increase recognition performance, using all three techniques together did. Power normalised cepstral coefficient (PNCC) features are tweaked somewhat in this work to enhance verification accuracy. To increase speaker verification accuracy, we suggest employing multiple “Gaussian Mixture Model-Universal Background Model (GMM-UBM) and SVM classifiers”. Importantly, pitch shift data augmentation and multi-task training reduced bias by more than 18% absolute compared to the baseline system for read speech, and applying all three data augmentation techniques during fine tuning reduced bias by more than 7% for HMI speech, while increasing recognition accuracy of both native and non-native Dutch speech

International Journal on Recent and Innovation Trends in Computing and Communication

Training Noise-Robust Spoken Phrase Detectors with Scarce and Private Data: An Application to Classroom Observation Videos

Author: Zylich Brian Matthew
Publication venue: Digital WPI
Publication date: 25/04/2019
Field of study

We explore how to automatically detect specific phrases in audio from noisy, multi-speaker videos using deep neural networks. Specifically, we focus on classroom observation videos that contain a few adult teachers and several small children (\u3c 5 years old). At any point in these videos, multiple people may be talking, shouting, crying, or singing simultaneously. Our goal is to recognize polite speech phrases such as Good job , Thank you , Please , and You\u27re welcome , as the occurrence of such speech is one of the behavioral markers used in classroom observation coding via the Classroom Assessment Scoring System (CLASS) protocol. Commercial speech recognition services such as Google Cloud Speech are impractical because of data privacy concerns. Therefore, we train and test our own custom models using a combination of publicly available classroom videos from YouTube, as well as a private dataset of real classroom observation videos collected by our colleagues at the University of Virginia. We also crowdsource an additional 1152 recordings of polite speech phrases to augment our training dataset. Our contributions are the following: (1) we design a crowdsourcing task for efficiently labeling speech events in classroom videos, (2) we develop a neural network-based architecture for speech recognition, robust to noise and overlapping speech, and (3) we explore methods to synthesize new and authentic audio data, both to increase the training set size and reduce the class imbalance. Finally, using our trained polite speech detector, (4) we investigate the relationship between polite speech and CLASS scores and enable teachers to visualize their use of polite language

DigitalCommons@WPI