3,111 research outputs found

    Unsupervised naming of speakers in broadcast TV: using written names, pronounced names or both ?

    Get PDF
    International audiencePersons identification in video from TV broadcast is a valuable tool for indexing them. However, the use of biometric mod- els is not a very sustainable option without a priori knowledge of people present in the videos. The pronounced names (PN) or written names (WN) on the screen can provide hypotheses names for speakers. We propose an experimental comparison of the potential of these two modalities (names pronounced or written) to extract the true names of the speakers. The names pronounced offer many instances of citation but transcription and named-entity detection errors halved the potential of this modality. On the contrary, the written names detection benefits of the video quality improvement and is nowadays rather robust and efficient to name speakers. Oracle experiments presented for the mapping between written names and speakers also show the complementarity of both PN and WN modalities

    Active Selection with Label Propagation for Minimizing Human Effort in Speaker Annotation of TV Shows

    No full text
    International audienceIn this paper an approach minimizing the human involvement in the manual annotation of speakers is presented. At each iter- ation a selection strategy choses the most suitable speech track for manual annotation, which is then associated with all the tracks in the cluster that contains it. The study makes use of a system that propagates the speaker track labels. This is done using a agglomerative clustering with constraints. Several dif- ferent unsupervised active learning selection strategies are eval- uated. Additionally, the presented approach can be used to ef- ficiently generate sets of speech tracks for training biometric models. In this case both the length of the speech track for a given person and its purity are taken into consideration. To evaluate the system the REPERE video corpus was used. Along with the speech tracks extracted from the videos, the op- tical character recognition system was adapted to extract names of potential speakers. This was then used as the 'cold start' for the selection method

    An Effective Technique for Removal of Facial Dupilcation by SBFA

    Get PDF
    Search based face annotation (SBFA) is an effective technique to annotate the weakly labeled facial images that are freely available on World Wide Web. The main objective of search based face annotation is to assign correct name labels to given query facial image. One difficult drawback for search based face annotation theme is how to effectively perform annotation by exploiting the list of most similar facial pictures and their weak labels that square measure typically droning and incomplete. To tackle this drawback, we tend to propose a good unattended label refinement (URL) approach for purification the labels of web facial pictures exploitation machine learning technique. We tend to formulate the educational drawback as a gibbose improvement and develop effective improvement algorithms to resolve the large scale learning task expeditiously. To additional speed up the projected theme, we also proposed clustering based approximation algorithmic program which may improve quantify ability significantly. We have conducted an in depth set of empirical studies on a large scale net facial image test bed, within which encouraging results showed that the projected URL algorithms will considerably boost the performance of the promising SBFA theme. In future work we will use HAAR algorithm. HAAR is feature based method for face detection. HAAR features, integral images, recognized detection of features improve face detection in terms of speed and accuracy. DOI: 10.17762/ijritcc2321-8169.150517

    Unsupervised Speaker Identification in TV Broadcast Based on Written Names

    No full text
    International audienceIdentifying speakers in TV broadcast in an unsuper- vised way (i.e. without biometric models) is a solution for avoiding costly annotations. Existing methods usually use pronounced names, as a source of names, for identifying speech clusters provided by a diarization step but this source is too imprecise for having sufficient confidence. To overcome this issue, another source of names can be used: the names written in a title block in the image track. We first compared these two sources of names on their abilities to provide the name of the speakers in TV broadcast. This study shows that it is more interesting to use written names for their high precision for identifying the current speaker. We also propose two approaches for finding speaker identity based only on names written in the image track. With the "late naming" approach, we propose different propagations of written names onto clusters. Our second proposition, "Early naming", modifies the speaker diarization module (agglomerative clustering) by adding constraints preventing two clusters with different associated written names to be merged together. These methods were tested on the REPERE corpus phase 1, containing 3 hours of annotated videos. Our best "late naming" system reaches an F-measure of 73.1%. "early naming" improves over this result both in terms of identification error rate and of stability of the clustering stopping criterion. By comparison, a mono-modal, supervised speaker identification system with 535 speaker models trained on matching development data and additional TV and radio data only provided a 57.2% F-measure

    Learning Multimodal Latent Attributes

    Get PDF
    Abstract—The rapid development of social media sharing has created a huge demand for automatic media classification and annotation techniques. Attribute learning has emerged as a promising paradigm for bridging the semantic gap and addressing data sparsity via transferring attribute knowledge in object recognition and relatively simple action classification. In this paper, we address the task of attribute learning for understanding multimedia data with sparse and incomplete labels. In particular we focus on videos of social group activities, which are particularly challenging and topical examples of this task because of their multi-modal content and complex and unstructured nature relative to the density of annotations. To solve this problem, we (1) introduce a concept of semi-latent attribute space, expressing user-defined and latent attributes in a unified framework, and (2) propose a novel scalable probabilistic topic model for learning multi-modal semi-latent attributes, which dramatically reduces requirements for an exhaustive accurate attribute ontology and expensive annotation effort. We show that our framework is able to exploit latent attributes to outperform contemporary approaches for addressing a variety of realistic multimedia sparse data learning tasks including: multi-task learning, learning with label noise, N-shot transfer learning and importantly zero-shot learning

    Recognizing faces in news photographs on the web

    Get PDF
    We propose a graph based method in order to recognize the faces that appear on the web using a small training set. First, relevant pictures of the desired people are collected by querying the name in a text based search engine in order to construct the data set. Then, detected faces in these photographs are represented using SIFT features extracted from facial features. The similarities of faces are represented in a graph which is then used in random walk with restart algorithm to provide links between faces. Those links are used for recognition by using two different methods. © 2009 IEEE
    • 

    corecore