116 research outputs found
An interactive and multi-level framework for summarising user generated videos
We present an interactive and multi-level abstraction framework for user-generated video (UGV) summarisation, allowing a user the flexibility to select a summarisation criterion out of a number of methods provided by the system. First, a given raw video is segmented into shots, and each shot is further decomposed into sub-shots in line with the change in dominant camera motion. Secondly, principal component analysis (PCA) is applied to the colour representation of the collection of sub-shots, and a content map is created using the first few components. Each sub-shot is represented with a ``footprint'' on the content map, which reveals its content significance (coverage) and the most dynamic segment. The final stage of abstraction is
devised in a user-assisted manner whereby a user is able to specify a desired summary length, with options to interactively perform abstraction at different granularity of visual comprehension. The results obtained show the potential benefit in significantly alleviating the burden of
laborious user intervention associated with conventional video editing/browsing
Dublin City University at the TRECVid 2008 BBC rushes summarisation task
We describe the video summarisation systems submitted by
Dublin City University to the TRECVid 2008 BBC Rushes
Summarisation task. We introduce a new approach to re-
dundant video summarisation based on principal component
analysis and linear discriminant analysis. The resulting low
dimensional representation of each shot offers a simple way
to compare and select representative shots of the original
video. The final summary is constructed as a dynamic sto-
ryboard. Both types of summaries were evaluated and the
results are discussed
Towards a better integration of written names for unsupervised speakers identification in videos
International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion
LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
More and more neural network approaches have achieved considerable
improvement upon submodules of speaker diarization system, including speaker
change detection and segment-wise speaker embedding extraction. Still, in the
clustering stage, traditional algorithms like probabilistic linear discriminant
analysis (PLDA) are widely used for scoring the similarity between two speech
segments. In this paper, we propose a supervised method to measure the
similarity matrix between all segments of an audio recording with sequential
bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is
applied on top of the similarity matrix to further improve the performance.
Experimental results show that our system significantly outperforms the
state-of-the-art methods and achieves a diarization error rate of 6.63% on the
NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201
An open-source voice type classifier for child-centered daylong recordings
Spontaneous conversations in real-world settings such as those found in
child-centered recordings have been shown to be amongst the most challenging
audio files to process. Nevertheless, building speech processing models
handling such a wide variety of conditions would be particularly useful for
language acquisition studies in which researchers are interested in the
quantity and quality of the speech that children hear and produce, as well as
for early diagnosis and measuring effects of remediation. In this paper, we
present our approach to designing an open-source neural network to classify
audio segments into vocalizations produced by the child wearing the recording
device, vocalizations produced by other children, adult male speech, and adult
female speech. To this end, we gathered diverse child-centered corpora which
sums up to a total of 260 hours of recordings and covers 10 languages. Our
model can be used as input for downstream tasks such as estimating the number
of words produced by adult speakers, or the number of linguistic units produced
by children. Our architecture combines SincNet filters with a stack of
recurrent layers and outperforms by a large margin the state-of-the-art system,
the Language ENvironment Analysis (LENA) that has been used in numerous child
language studies.Comment: accepted to Interspeech 202
Collaborative Annotation for Person Identification in TV Shows
International audienceThis paper presents a collaborative annotation framework for person identification in TV shows. The web annotation front-end will be demonstrated during the Show and Tell session. All the code for annotation is made available on github. The tool can also be used in a crowd-sourcing environment
Hierarchical Late Fusion for Concept Detection in Videos
Oral session 1: WS21 - Workshop on Information Fusion in Computer Vision for Concept RecognitionInternational audienceWe deal with the issue of combining dozens of classifiers into a better one, for concept detection in videos. We compare three fusion approaches that share a common structure: they all start with a classifier clustering stage, continue with an intra-cluster fusion and end with an inter-cluster fusion. The main difference between them comes from the first stage. The first approach relies on a priori knowledge about the internals of each classifier (low-level descriptors and classification algorithm) to group the set of available classifiers by similarity. The second and third approaches obtain classifier similarity measures directly from their output and group them using agglomerative clustering for the second approach and community detection for the third one
- âŠ