6 research outputs found
Utilization of multimodal interaction signals for automatic summarisation of academic presentations
Multimedia archives are expanding rapidly. For these, there exists a shortage of retrieval and summarisation techniques for accessing and browsing content where the main information exists in the audio stream. This thesis describes an investigation into the development of novel feature extraction and summarisation techniques for audio-visual recordings of academic presentations.
We report on the development of a multimodal dataset of academic presentations. This dataset is labelled by human annotators to the concepts of presentation ratings, audience engagement levels, speaker emphasis, and audience comprehension. We investigate the automatic classification of speaker ratings and audience engagement by extracting audio-visual features from video of the presenter and audience and training classifiers to predict speaker ratings and engagement levels. Following this, we investigate automatic identi�cation of areas of emphasised speech. By analysing all human annotated areas of emphasised speech, minimum speech pitch and gesticulation are identified as indicating emphasised speech when occurring together.
Investigations are conducted into the speaker's potential to be comprehended by the audience. Following crowdsourced annotation of comprehension levels during academic presentations, a set of audio-visual features considered most likely to affect comprehension levels are extracted. Classifiers are trained on these features and comprehension levels could be predicted over a 7-class scale to an accuracy of 49%, and over a binary distribution to an accuracy of 85%.
Presentation summaries are built by segmenting speech transcripts into phrases, and using keywords extracted from the transcripts in conjunction with extracted paralinguistic features. Highest ranking segments are then extracted to build presentation summaries. Summaries are evaluated by performing eye-tracking experiments as participants watch presentation videos. Participants were found to be consistently more engaged for presentation summaries than for full presentations. Summaries were also found to contain a higher concentration of new information than full presentations
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
Meeting decision detection: multimodal information fusion for multi-party dialogue understanding
Modern advances in multimedia and storage technologies have led to huge archives
of human conversations in widely ranging areas. These archives offer a wealth of information
in the organization contexts. However, retrieving and managing information
in these archives is a time-consuming and labor-intensive task. Previous research applied
keyword and computer vision-based methods to do this. However, spontaneous
conversations, complex in the use of multimodal cues and intricate in the interactions
between multiple speakers, have posed new challenges to these methods. We need
new techniques that can leverage the information hidden in multiple communication
modalities â including not just âwhatâ the speakers say but also âhowâ they express
themselves and interact with others.
In responding to this need, the thesis inquires into the multimodal nature of meeting
dialogues and computational means to retrieve and manage the recorded meeting
information. In particular, this thesis develops the Meeting Decision Detector (MDD)
to detect and track decisions, one of the most important outcomes of the meetings.
The MDD involves not only the generation of extractive summaries pertaining to the
decisions (âdecision detectionâ), but also the organization of a continuous stream of
meeting speech into locally coherent segments (âdiscourse segmentationâ).
This inquiry starts with a corpus analysis which constitutes a comprehensive empirical
study of the decision-indicative and segment-signalling cues in the meeting
corpora. These cues are uncovered from a variety of communication modalities, including
the words spoken, gesture and head movements, pitch and energy level, rate
of speech, pauses, and use of subjective terms. While some of the cues match the
previous findings of speech segmentation, some others have not been studied before.
The analysis also provides empirical grounding for computing features and integrating
them into a computational model. To handle the high-dimensional multimodal
feature space in the meeting domain, this thesis compares empirically feature discriminability
and feature pattern finding criteria. As the different knowledge sources are
expected to capture different types of features, the thesis also experiments with methods
that can harness synergy between the multiple knowledge sources.
The problem formalization and the modeling algorithm so far correspond to an
optimal setting: an off-line, post-meeting analysis scenario. However, ultimately the
MDD is expected to be operated online â right after a meeting, or when a meeting
is still in progress. Thus this thesis also explores techniques that help relax the optimal
setting, especially those using only features that can be generated with a higher
degree of automation. Empirically motivated experiments are designed to handle the
corresponding performance degradation.
Finally, with the users in mind, this thesis evaluates the use of query-focused summaries
in a decision debriefing task, which is common in the organization context. The
decision-focused extracts (which represent compressions of 1%) is compared against
the general-purpose extractive summaries (which represent compressions of 10-40%).
To examine the effect of model automation on the debriefing task, this evaluation experiments
with three versions of decision-focused extracts, each relaxing one manual
annotation constraint. Task performance is measured in actual task effectiveness, usergenerated
report quality, and user-perceived success. The usersâ clicking behaviors are
also recorded and analyzed to understand how the users leverage the different versions
of extractive summaries to produce abstractive summaries.
The analysis framework and computational means developed in this work is expected
to be useful for the creation of other dialogue understanding applications, especially
those that require to uncover the implicit semantics of meeting dialogues
VOCAL BIOMARKERS OF CLINICAL DEPRESSION: WORKING TOWARDS AN INTEGRATED MODEL OF DEPRESSION AND SPEECH
Speech output has long been considered a sensitive marker of a personâs mental state. It has been previously examined as a possible biomarker for diagnosis and treatment response for certain mental health conditions, including clinical depression. To date, it has been difficult to draw robust conclusions from past results due to diversity in samples, speech material, investigated parameters, and analytical methods.
Within this exploratory study of speech in clinically depressed individuals, articulatory and phonatory behaviours are examined in relation to psychomotor symptom profiles and overall symptom severity. A systematic review provided context from the existing body of knowledge on the effects of depression on speech, and provided context for experimental setup within this body of work. Examinations of vowel space, monophthong, and diphthong productions as well as a multivariate acoustic analysis of other speech parameters (e.g., F0 range, perturbation measures, composite measures, etc.) are undertaken with the goal of creating a working model of the effects of depression on speech. Initial results demonstrate that overall vowel space area was not different between depressed and healthy speakers, but on closer inspection, this was due to more specific deficits seen in depressed patients along the first formant (F1) axis. Speakers with depression were more likely to produce centralised vowels along F1, as compared to F2âand this was more pronounced for low-front vowels, which are more complex given the degree of tongue-jaw coupling required for production. This pattern was seen in both monophthong and diphthong productions. Other articulatory and phonatory measures were inspected in a factor analysis as well, suggesting additional vocal biomarkers for consideration in diagnosis and treatment assessment of depressionâincluding aperiodicity measures (e.g., higher shimmer and jitter), changes in spectral slope and tilt, and additive noise measures such as increased harmonics-to-noise ratio. Intonation was also affected by diagnostic status, but only for specific speech tasks. These results suggest that laryngeal and articulatory control is reduced by depression.
Findings support the clinical utility of combining Ellgring and Schererâs (1996) psychomotor retardation and social-emotional hypotheses to explain the effects of depression on speech, which suggest observed changes are due to a combination of cognitive, psycho-physiological and motoric mechanisms. Ultimately, depressive speech is able to be modelled along a continuum of hypo- to hyper-speech, where depressed individuals are able to assess communicative situations, assess speech requirements, and then engage in the minimum amount of motoric output necessary to convey their message. As speakers fluctuate with depressive symptoms throughout the course of their disorder, they move along the hypo-hyper-speech continuum and their speech is impacted accordingly.
Recommendations for future clinical investigations of the effects of depression on speech are also presented, including suggestions for recording and reporting standards. Results contribute towards cross-disciplinary research into speech analysis between the fields of psychiatry, computer science, and speech science
Instilling reflective practice â The use of an online portfolio in innovative optometric education Accepted as: eâposter Paper no. 098
At UCLAN we are breaking the mould and have developed a blended learning MSci optometry programme which is the first blended learning course in optometric education in the UK and the first to use a practice-based online portfolio.
Optometry has traditionally been taught as a 3âyear undergraduate programme. Upon successful graduation, students are required to complete a year in practice and meet the General Optical Council's (GOC) âability toâ core competencies. However, a recent study by the GOC found that 76% of students felt unprepared for professional practice with insufficient clinical experience and in response, the GOC is currently undertaking an educational strategic review.
To ensure the students receive high-quality clinical experience in the workplace, we have developed an online logbook and portfolio. Students log their experiences, learning points and reflections. The portfolio is closely monitored both by the student's mentor in practice and by academic staff.
The content and reflections logged by the students then helps to drive the face to face teaching, small group discussions and clinical experiences provided by the university