80,976 research outputs found
Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking
Public speaking is an important aspect of human communication and
interaction. The majority of computational work on public speaking concentrates
on analyzing the spoken content, and the verbal behavior of the speakers. While
the success of public speaking largely depends on the content of the talk, and
the verbal behavior, non-verbal (visual) cues, such as gestures and physical
appearance also play a significant role. This paper investigates the importance
of visual cues by estimating their contribution towards predicting the
popularity of a public lecture. For this purpose, we constructed a large
database of more than TED talk videos. As a measure of popularity of the
TED talks, we leverage the corresponding (online) viewers' ratings from
YouTube. Visual cues related to facial and physical appearance, facial
expressions, and pose variations are extracted from the video frames using
convolutional neural network (CNN) models. Thereafter, an attention-based long
short-term memory (LSTM) network is proposed to predict the video popularity
from the sequence of visual features. The proposed network achieves
state-of-the-art prediction accuracy indicating that visual cues alone contain
highly predictive information about the popularity of a talk. Furthermore, our
network learns a human-like attention mechanism, which is particularly useful
for interpretability, i.e. how attention varies with time, and across different
visual cues by indicating their relative importance
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
A Deep Sequential Model for Discourse Parsing on Multi-Party Dialogues
Discourse structures are beneficial for various NLP tasks such as dialogue
understanding, question answering, sentiment analysis, and so on. This paper
presents a deep sequential model for parsing discourse dependency structures of
multi-party dialogues. The proposed model aims to construct a discourse
dependency tree by predicting dependency relations and constructing the
discourse structure jointly and alternately. It makes a sequential scan of the
Elementary Discourse Units (EDUs) in a dialogue. For each EDU, the model
decides to which previous EDU the current one should link and what the
corresponding relation type is. The predicted link and relation type are then
used to build the discourse structure incrementally with a structured encoder.
During link prediction and relation classification, the model utilizes not only
local information that represents the concerned EDUs, but also global
information that encodes the EDU sequence and the discourse structure that is
already built at the current step. Experiments show that the proposed model
outperforms all the state-of-the-art baselines.Comment: Accepted to AAAI 201
Special Libraries, April 1954
Volume 45, Issue 4https://scholarworks.sjsu.edu/sla_sl_1954/1003/thumbnail.jp
Acoustic model adaptation for ortolan bunting (Emberiza hortulana L.) song-type classification
Automatic systems for vocalization classification often require fairly large amounts of data on which to train models. However, animal vocalization data collection and transcription is a difficult and time-consuming task, so that it is expensive to create large data sets. One natural solution to this problem is the use of acoustic adaptation methods. Such methods, common in human speech recognition systems, create initial models trained on speaker independent data, then use small amounts of adaptation data to build individual-specific models. Since, as in human speech, individual vocal variability is a significant source of variation in bioacoustic data, acoustic model adaptation is naturally suited to classification in this domain as well. To demonstrate and evaluate the effectiveness of this approach, this paper presents the application of maximum likelihood linear regression adaptation to ortolan bunting (Emberiza hortulana L.) song-type classification. Classification accuracies for the adapted system are computed as a function of the amount of adaptation data and compared to caller-independent and caller-dependent systems. The experimental results indicate that given the same amount of data, supervised adaptation significantly outperforms both caller-independent and caller-dependent systems
- …