116 research outputs found

    An interactive and multi-level framework for summarising user generated videos

    Get PDF
    We present an interactive and multi-level abstraction framework for user-generated video (UGV) summarisation, allowing a user the flexibility to select a summarisation criterion out of a number of methods provided by the system. First, a given raw video is segmented into shots, and each shot is further decomposed into sub-shots in line with the change in dominant camera motion. Secondly, principal component analysis (PCA) is applied to the colour representation of the collection of sub-shots, and a content map is created using the first few components. Each sub-shot is represented with a ``footprint'' on the content map, which reveals its content significance (coverage) and the most dynamic segment. The final stage of abstraction is devised in a user-assisted manner whereby a user is able to specify a desired summary length, with options to interactively perform abstraction at different granularity of visual comprehension. The results obtained show the potential benefit in significantly alleviating the burden of laborious user intervention associated with conventional video editing/browsing

    Dublin City University at the TRECVid 2008 BBC rushes summarisation task

    Get PDF
    We describe the video summarisation systems submitted by Dublin City University to the TRECVid 2008 BBC Rushes Summarisation task. We introduce a new approach to re- dundant video summarisation based on principal component analysis and linear discriminant analysis. The resulting low dimensional representation of each shot offers a simple way to compare and select representative shots of the original video. The final summary is constructed as a dynamic sto- ryboard. Both types of summaries were evaluated and the results are discussed

    Towards a better integration of written names for unsupervised speakers identification in videos

    No full text
    International audienceExisting methods for unsupervised identification of speakers in TV broadcast usually rely on the output of a speaker diariza- tion module and try to name each cluster using names provided by another source of information: we call it "late naming". Hence, written names extracted from title blocks tend to lead to high precision identification, although they cannot correct er- rors made during the clustering step. In this paper, we extend our previous "late naming" ap- proach in two ways: "integrated naming" and "early naming". While "late naming" relies on a speaker diarization module op- timized for speaker diarization, "integrated naming" jointly op- timize speaker diarization and name propagation in terms of identification errors. "Early naming" modifies the speaker di- arization module by adding constraints preventing two clusters with different written names to be merged together. While "integrated naming" yields similar identification per- formance as "late naming" (with better precision), "early nam- ing" improves over this baseline both in terms of identification error rate and stability of the clustering stopping criterion

    LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization

    Full text link
    More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage, traditional algorithms like probabilistic linear discriminant analysis (PLDA) are widely used for scoring the similarity between two speech segments. In this paper, we propose a supervised method to measure the similarity matrix between all segments of an audio recording with sequential bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is applied on top of the similarity matrix to further improve the performance. Experimental results show that our system significantly outperforms the state-of-the-art methods and achieves a diarization error rate of 6.63% on the NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201

    An open-source voice type classifier for child-centered daylong recordings

    Full text link
    Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.Comment: accepted to Interspeech 202

    Collaborative Annotation for Person Identification in TV Shows

    No full text
    International audienceThis paper presents a collaborative annotation framework for person identification in TV shows. The web annotation front-end will be demonstrated during the Show and Tell session. All the code for annotation is made available on github. The tool can also be used in a crowd-sourcing environment

    Hierarchical Late Fusion for Concept Detection in Videos

    Get PDF
    Oral session 1: WS21 - Workshop on Information Fusion in Computer Vision for Concept RecognitionInternational audienceWe deal with the issue of combining dozens of classifiers into a better one, for concept detection in videos. We compare three fusion approaches that share a common structure: they all start with a classifier clustering stage, continue with an intra-cluster fusion and end with an inter-cluster fusion. The main difference between them comes from the first stage. The first approach relies on a priori knowledge about the internals of each classifier (low-level descriptors and classification algorithm) to group the set of available classifiers by similarity. The second and third approaches obtain classifier similarity measures directly from their output and group them using agglomerative clustering for the second approach and community detection for the third one
    • 

    corecore