11 research outputs found
Recommended from our members
Detecting Levels of Interest from Spoken Dialog with Multistream Prediction Feedback and Similarity Based Hierarchical Fusion Learning
Detecting levels of interest from speakers is a new problem in Spoken Dialog Understanding with significant impact on real world business applications. Previous work has focused on the analysis of traditional acoustic signals and shallow lexical features. In this paper, we present a novel hierarchical fusion learning model that takes feedback from previous multistream predictions of prominent seed samples into account and uses a mean cosine similarity measure to learn rules that improve reclassification. Our method is domain-independent and can be adapted to
other speech and language processing areas where domain adaptation is expensive to perform. Incorporating Discriminative Term Frequency
and Inverse Document Frequency (DTFIDF), lexical affect scoring, and low and high level prosodic and acoustic features, our experiments outperform the published results of all systems participating in the 2010 Interspeech Paralinguistic Affect Subchallenge
When Does Disengagement Correlate with Performance in Spoken Dialog Computer Tutoring?
In this paper we investigate how student disengagement relates to two performance metrics in a spoken dialog computer tutoring corpus, both when disengagement is measured through manual annotation by a trained human judge, and also when disengagement is measured through automatic annotation by the system based on a machine learning model. First, we investigate whether manually labeled overall disengagement and six different disengagement types are predictive of learning and user satisfaction in the corpus. Our results show that although students’ percentage of overall disengaged turns negatively correlates both with the amount they learn and their user satisfaction, the individual types of disengagement correlate differently: some negatively correlate with learning and user satisfaction, while others don’t correlate with eithermetric at all. Moreover, these relationships change somewhat depending on student prerequisite knowledge level. Furthermore, using multiple disengagement types to predict learning improves predictive power. Overall, these manual label-based results suggest that although adapting to disengagement should improve both student learning and user satisfaction in computer tutoring, maximizing performance requires the system to detect and respond differently based on disengagement type. Next, we present an approach to automatically detecting and responding to user disengagement types based on their differing correlations with correctness. Investigation of ourmachine learningmodel of user disengagement shows that its automatic labels negatively correlate with both performance metrics in the same way as the manual labels. The similarity of the correlations across the manual and automatic labels suggests that the automatic labels are a reasonable substitute for the manual labels. Moreover, the significant negative correlations themselves suggest that redesigning ITSPOKE to automatically detect and respond to disengagement has the potential to remediate disengagement and thereby improve performance, even in the presence of noise introduced by the automatic detection process
Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty
communicative events, such as meetings, is an ambitious research area. We
have investigated automatic meeting segmentation both in terms of “Meeting Actions”
and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine
grained level highlighting individual speaker intentions. Group meeting actions describe
the same process at a coarse level, highlighting interactions between different
meeting participants and showing overall group intentions.
A framework based on probabilistic graphical models such as dynamic Bayesian
networks (DBNs) has been investigated for both tasks. Our first set of experiments
is concerned with the segmentation and structuring of meetings (recorded using
multiple cameras and microphones) into sequences of group meeting actions such
as monologue, discussion and presentation. We outline four families of multimodal
features based on speaker turns, lexical transcription, prosody, and visual motion
that are extracted from the raw audio and video recordings. We relate these lowlevel
multimodal features to complex group behaviours proposing a multistreammodelling
framework based on dynamic Bayesian networks. Later experiments are
concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty
conversational speech. We present a joint generative approach based on a switching
DBN for DA recognition in which segmentation and classification of DAs are
carried out in parallel. This approach models a set of features, related to lexical
content and prosody, and incorporates a weighted interpolated factored language
model. In conjunction with this joint generative model, we have also investigated
the use of a discriminative approach, based on conditional random fields, to perform
a reclassification of the segmented DAs.
The DBN based approach yielded significant improvements when applied both
to the meeting action and the dialogue act recognition task. On both tasks, the DBN
framework provided an effective factorisation of the state-space and a flexible infrastructure
able to integrate a heterogeneous set of resources such as continuous
and discrete multimodal features, and statistical language models. Although our
experiments have been principally targeted on multiparty meetings; features, models,
and methodologies developed in this thesis can be employed for a wide range
of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features
for several related research areas such as speaker addressing and focus of attention
modelling, automatic speech recognition and understanding, topic and decision detection
マルチモーダル音声対話システムでの先進的コミュニケーションのためのユーザ状態推定
Tohoku University伊藤彰則課
IberSPEECH 2020: XI Jornadas en Tecnología del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli
A Statistical Approach to the Alignment of fMRI Data
Multi-subject functional Magnetic Resonance Image studies are critical. The anatomical and functional structure varies across subjects, so the image alignment is necessary. We define a probabilistic model to describe functional alignment. Imposing a prior distribution, as the matrix Fisher Von Mises distribution, of the orthogonal transformation parameter, the anatomical information is embedded in the estimation of the parameters, i.e., penalizing the combination of spatially distant voxels. Real applications show an improvement in the classification and interpretability of the results compared to various functional alignment methods
A comparison of the CAR and DAGAR spatial random effects models with an application to diabetics rate estimation in Belgium
When hierarchically modelling an epidemiological phenomenon on a finite collection of sites in space, one must always take a latent spatial effect into account in order to capture the correlation structure that links the phenomenon to the territory. In this work, we compare two autoregressive spatial models that can be used for this purpose: the classical CAR model and the more recent DAGAR model. Differently from the former, the latter has a desirable property: its ρ parameter can be naturally interpreted as the average neighbor pair correlation and, in addition, this parameter can be directly estimated when the effect is modelled using a DAGAR rather than a CAR structure. As an application, we model the diabetics rate in Belgium in 2014 and show the adequacy of these models in predicting the response variable when no covariates are available