799 research outputs found

    Automatic Synchronization of Multi-User Photo Galleries

    Full text link
    In this paper we address the issue of photo galleries synchronization, where pictures related to the same event are collected by different users. Existing solutions to address the problem are usually based on unrealistic assumptions, like time consistency across photo galleries, and often heavily rely on heuristics, limiting therefore the applicability to real-world scenarios. We propose a solution that achieves better generalization performance for the synchronization task compared to the available literature. The method is characterized by three stages: at first, deep convolutional neural network features are used to assess the visual similarity among the photos; then, pairs of similar photos are detected across different galleries and used to construct a graph; eventually, a probabilistic graphical model is used to estimate the temporal offset of each pair of galleries, by traversing the minimum spanning tree extracted from this graph. The experimental evaluation is conducted on four publicly available datasets covering different types of events, demonstrating the strength of our proposed method. A thorough discussion of the obtained results is provided for a critical assessment of the quality in synchronization.Comment: ACCEPTED to IEEE Transactions on Multimedi

    Organising a photograph collection based on human appearance

    Get PDF
    This thesis describes a complete framework for organising digital photographs in an unsupervised manner, based on the appearance of people captured in the photographs. Organising a collection of photographs manually, especially providing the identities of people captured in photographs, is a time consuming task. Unsupervised grouping of images containing similar persons makes annotating names easier (as a group of images can be named at once) and enables quick search based on query by example. The full process of unsupervised clustering is discussed in this thesis. Methods for locating facial components are discussed and a technique based on colour image segmentation is proposed and tested. Additionally a method based on the Principal Component Analysis template is tested, too. These provide eye locations required for acquiring a normalised facial image. This image is then preprocessed by a histogram equalisation and feathering, and the features of MPEG-7 face recognition descriptor are extracted. A distance measure proposed in the MPEG-7 standard is used as a similarity measure. Three approaches to grouping that use only face recognition features for clustering are analysed. These are modified k-means, single-link and a method based on a nearest neighbour classifier. The nearest neighbour-based technique is chosen for further experiments with fusing information from several sources. These sources are context-based such as events (party, trip, holidays), the ownership of photographs, and content-based such as information about the colour and texture of the bodies of humans appearing in photographs. Two techniques are proposed for fusing event and ownership (user) information with the face recognition features: a Transferable Belief Model (TBM) and three level clustering. The three level clustering is carried out at “event” level, “user” level and “collection” level. The latter technique proves to be most efficient. For combining body information with the face recognition features, three probabilistic fusion methods are tested. These are the average sum, the generalised product and the maximum rule. Combinations are tested within events and within user collections. This work concludes with a brief discussion on extraction of key images for a representation of each cluster

    Enhancing person annotation for personal photo management using content and context based technologies

    Get PDF
    Rapid technological growth and the decreasing cost of photo capture means that we are all taking more digital photographs than ever before. However, lack of technology for automatically organising personal photo archives has resulted in many users left with poorly annotated photos, causing them great frustration when such photo collections are to be browsed or searched at a later time. As a result, there has recently been significant research interest in technologies for supporting effective annotation. This thesis addresses an important sub-problem of the broad annotation problem, namely "person annotation" associated with personal digital photo management. Solutions to this problem are provided using content analysis tools in combination with context data within the experimental photo management framework, called “MediAssist”. Readily available image metadata, such as location and date/time, are captured from digital cameras with in-built GPS functionality, and thus provide knowledge about when and where the photos were taken. Such information is then used to identify the "real-world" events corresponding to certain activities in the photo capture process. The problem of enabling effective person annotation is formulated in such a way that both "within-event" and "cross-event" relationships of persons' appearances are captured. The research reported in the thesis is built upon a firm foundation of content-based analysis technologies, namely face detection, face recognition, and body-patch matching together with data fusion. Two annotation models are investigated in this thesis, namely progressive and non-progressive. The effectiveness of each model is evaluated against varying proportions of initial annotation, and the type of initial annotation based on individual and combined face, body-patch and person-context information sources. The results reported in the thesis strongly validate the use of multiple information sources for person annotation whilst emphasising the advantage of event-based photo analysis in real-life photo management systems

    Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

    Full text link
    Video temporal character grouping locates appearing moments of major characters within a video according to their identities. To this end, recent works have evolved from unsupervised clustering to graph-based supervised clustering. However, graph methods are built upon the premise of fixed affinity graphs, bringing many inexact connections. Besides, they extract multi-modal features with kinds of models, which are unfriendly to deployment. In this paper, we present a unified and dynamic graph (UniDG) framework for temporal character grouping. This is accomplished firstly by a unified representation network that learns representations of multiple modalities within the same space and still preserves the modality's uniqueness simultaneously. Secondly, we present a dynamic graph clustering where the neighbors of different quantities are dynamically constructed for each node via a cyclic matching strategy, leading to a more reliable affinity graph. Thirdly, a progressive association method is introduced to exploit spatial and temporal contexts among different modalities, allowing multi-modal clustering results to be well fused. As current datasets only provide pre-extracted features, we evaluate our UniDG method on a collected dataset named MTCG, which contains each character's appearing clips of face and body and speaking voice tracks. We also evaluate our key components on existing clustering and retrieval datasets to verify the generalization ability. Experimental results manifest that our method can achieve promising results and outperform several state-of-the-art approaches

    Deep Feature Representation and Similarity Matrix based Noise Label Refinement Method for Efficient Face Annotation

    Get PDF
    Face annotation is a naming procedure that assigns the correct name to a person emerging from an image. Faces that are manually annotated by people in online applications include incorrect labels, giving rise to the issue of label ambiguity. This may lead to mislabelling in face annotation. Consequently, an efficient method is still essential to enhance the reliability of face annotation. Hence, in this work, a novel method named the Similarity Matrix-based Noise Label Refinement (SMNLR) is proposed, which effectively predicts the accurate label from the noisy labelled facial images. To enhance the performance of the proposed method, the deep learning technique named Convolutional Neural Networks (CNN) is used for feature representation. Several experiments are conducted to evaluate the effectiveness of the proposed face annotation method using the LFW, IMFDB and Yahoo datasets. The experimental results clearly illustrate the robustness of the proposed SMNLR method in dealing with noisy labelled faces

    Bridging semantic gap: learning and integrating semantics for content-based retrieval

    Full text link
    Digital cameras have entered ordinary homes and produced^incredibly large number of photos. As a typical example of broad image domain, unconstrained consumer photos vary significantly. Unlike professional or domain-specific images, the objects in the photos are ill-posed, occluded, and cluttered with poor lighting, focus, and exposure. Content-based image retrieval research has yet to bridge the semantic gap between computable low-level information and high-level user interpretation. In this thesis, we address the issue of semantic gap with a structured learning framework to allow modular extraction of visual semantics. Semantic image regions (e.g. face, building, sky etc) are learned statistically, detected directly from image without segmentation, reconciled across multiple scales, and aggregated spatially to form compact semantic index. To circumvent the ambiguity and subjectivity in a query, a new query method that allows spatial arrangement of visual semantics is proposed. A query is represented as a disjunctive normal form of visual query terms and processed using fuzzy set operators. A drawback of supervised learning is the manual labeling of regions as training samples. In this thesis, a new learning framework to discover local semantic patterns and to generate their samples for training with minimal human intervention has been developed. The discovered patterns can be visualized and used in semantic indexing. In addition, three new class-based indexing schemes are explored. The winnertake- all scheme supports class-based image retrieval. The class relative scheme and the local classification scheme compute inter-class memberships and local class patterns as indexes for similarity matching respectively. A Bayesian formulation is proposed to unify local and global indexes in image comparison and ranking that resulted in superior image retrieval performance over those of single indexes. Query-by-example experiments on 2400 consumer photos with 16 semantic queries show that the proposed approaches have significantly better (18% to 55%) average precisions than a high-dimension feature fusion approach. The thesis has paved two promising research directions, namely the semantics design approach and the semantics discovery approach. They form elegant dual frameworks that exploits pattern classifiers in learning and integrating local and global image semantics
    • 

    corecore