10 research outputs found

    The role of context in image annotation and recommendation

    Get PDF
    With the rise of smart phones, lifelogging devices (e.g. Google Glass) and popularity of image sharing websites (e.g. Flickr), users are capturing and sharing every aspect of their life online producing a wealth of visual content. Of these uploaded images, the majority are poorly annotated or exist in complete semantic isolation making the process of building retrieval systems difficult as one must firstly understand the meaning of an image in order to retrieve it. To alleviate this problem, many image sharing websites offer manual annotation tools which allow the user to “tag” their photos, however, these techniques are laborious and as a result have been poorly adopted; Sigurbjörnsson and van Zwol (2008) showed that 64% of images uploaded to Flickr are annotated with < 4 tags. Due to this, an entire body of research has focused on the automatic annotation of images (Hanbury, 2008; Smeulders et al., 2000; Zhang et al., 2012a) where one attempts to bridge the semantic gap between an image’s appearance and meaning e.g. the objects present. Despite two decades of research the semantic gap still largely exists and as a result automatic annotation models often offer unsatisfactory performance for industrial implementation. Further, these techniques can only annotate what they see, thus ignoring the “bigger picture” surrounding an image (e.g. its location, the event, the people present etc). Much work has therefore focused on building photo tag recommendation (PTR) methods which aid the user in the annotation process by suggesting tags related to those already present. These works have mainly focused on computing relationships between tags based on historical images e.g. that NY and timessquare co-exist in many images and are therefore highly correlated. However, tags are inherently noisy, sparse and ill-defined often resulting in poor PTR accuracy e.g. does NY refer to New York or New Year? This thesis proposes the exploitation of an image’s context which, unlike textual evidences, is always present, in order to alleviate this ambiguity in the tag recommendation process. Specifically we exploit the “what, who, where, when and how” of the image capture process in order to complement textual evidences in various photo tag recommendation and retrieval scenarios. In part II, we combine text, content-based (e.g. # of faces present) and contextual (e.g. day-of-the-week taken) signals for tag recommendation purposes, achieving up to a 75% improvement to precision@5 in comparison to a text-only TF-IDF baseline. We then consider external knowledge sources (i.e. Wikipedia & Twitter) as an alternative to (slower moving) Flickr in order to build recommendation models on, showing that similar accuracy could be achieved on these faster moving, yet entirely textual, datasets. In part II, we also highlight the merits of diversifying tag recommendation lists before discussing at length various problems with existing automatic image annotation and photo tag recommendation evaluation collections. In part III, we propose three new image retrieval scenarios, namely “visual event summarisation”, “image popularity prediction” and “lifelog summarisation”. In the first scenario, we attempt to produce a rank of relevant and diverse images for various news events by (i) removing irrelevant images such memes and visual duplicates (ii) before semantically clustering images based on the tweets in which they were originally posted. Using this approach, we were able to achieve over 50% precision for images in the top 5 ranks. In the second retrieval scenario, we show that by combining contextual and content-based features from images, we are able to predict if it will become “popular” (or not) with 74% accuracy, using an SVM classifier. Finally, in chapter 9 we employ blur detection and perceptual-hash clustering in order to remove noisy images from lifelogs, before combining visual and geo-temporal signals in order to capture a user’s “key moments” within their day. We believe that the results of this thesis show an important step towards building effective image retrieval models when there lacks sufficient textual content (i.e. a cold start)

    ICMR 2014: 4th ACM International Conference on Multimedia Retrieval

    Get PDF
    ICMR was initially started as a workshop on challenges in image retrieval (in Newcastle in 1998 ) and later transformed into the Conference on Image and Video Retrieval (CIVR) series. In 2011 the CIVR and the ACM Workshop on Multimedia Information Retrieval were combined into a single conference that now forms the ICMR series. The 4th ACM International Conference on Multimedia Retrieval took place in Glasgow, Scotland, from 1 – 4 April 2014. This was the largest edition of ICMR to date with approximately 170 attendees from 25 different countries. ICMR is one of the premier scientific conference for multimedia retrieval held worldwide, with the stated mission “to illuminate the state of the art in multimedia retrieval by bringing together researchers and practitioners in the field of multimedia retrieval .” According to the Chinese Computing Federation Conference Ranking (2013), ACM ICMR is the number one multimedia retrieval conference worldwide and the number four conference in the category of multimedia and graphics. Although ICMR is about multimedia retrieval, in a wider sense, it is also about automated multimedia understanding. Much of the work in that area involves the analysis of media on a pixel, voxel, and wavelet level, but it also involves innovative retrieval, visualisation and interaction paradigms utilising the nature of the multimedia — be it video, images, speech, or more abstract (sensor) data. The conference aims to promote intellectual exchanges and interactions among scientists, engineers, students, and multimedia researchers in academia as well as industry through various events, including a keynote talk, oral, special and poster sessions focused on re search challenges and solutions, technical and industrial demonstrations of prototypes, tutorials, research, and an industrial panel. In the remainder of this report we will summarise the events that took place at the 4th ACM ICMR conference

    Exploiting Twitter and Wikipedia for the annotation of event images

    No full text
    With the rise in popularity of smart phones, there has been a recent increase in the number of images taken at large social (e.g. festivals) and world (e.g. natural disasters) events which are uploaded to image sharing websites such as Flickr. As with all online images, they are often poorly annotated, resulting in a difficult retrieval scenario. To overcome this problem, many photo tag recommendation methods have been introduced, however, these methods all rely on historical Flickr data which is often problematic for a number of reasons, including the time lag problem (i.e. in our collection, users upload images on average 50 days after taking them, meaning "training data" is often out of date). In this paper, we develop an image annotation model which exploits textual content from related Twitter and Wikipedia data which aims to overcome the discussed problems. The results of our experiments show and highlight the merits of exploiting social media data for annotating event images, where we are able to achieve recommendation accuracy comparable with a state-of-the-art model

    A novel system for the semi automatic annotation of event images

    No full text
    With the rise in popularity of smart phones, taking and sharing photographs has never been more openly accessible. Further, photo sharing websites, such as Flickr, have made the distribution of photographs easy, resulting in an increase of visual content uploaded online. Due to the laborious nature of annotating images, however, a large percentage of these images are unannotated making their organisation and retrieval difficult. Therefore, there has been a recent research focus on the automatic and semi-automatic process of annotating these images. Despite the progress made in this field, however, annotating images automatically based on their visual appearance often results in unsatisfactory suggestions and as a result these models have not been adopted in photo sharing websites. Many methods have therefore looked to exploit new sources of evidence for annotation purposes, such as image context for example. In this demonstration, we instead explore the scenario of annotating images taken at a large scale events where evidences can be extracted from a wealth of online textual resources. Specifically, we present a novel tag recommendation system for images taken at a popular music festival which allows the user to select relevant tags from related Tweets and Wikipedia content, thus reducing the workload involved in the annotation process

    Exploiting time in automatic image tagging

    No full text
    Existing automatic image annotation (AIA) models that depend solely on low-level image features often produce poor results, particularly when annotating real-life collections. Tag co-occurrence has been shown to improve image annotation by identifying additional keywords associated with user-provided keywords. However, existing approaches have treated tag co-occurrence as a static measure over time, thereby ignoring the temporal trends of many tags. The temporal distribution of tags, however, caused by events, seasons, memes, etc. provide a strong source of evidence beyond keywords for AIA. In this paper we propose a temporal tag co-occurrence approach to improve upon the current state-of-the-art automatic image annotation model. By replacing the annotated tags with more temporally significant tags, we achieve statistically significant increases to annotation accuracy on a real-life timestamped image collection from Flickr

    "Nobody comes here anymore, it's too crowded"; Predicting Image Popularity on Flickr

    No full text
    Predicting popular content is a challenging problem for social media websites in order to encourage user interactions and activity. Existing works in this area, including the recommendation approach used by Flickr (called "interestingness"), consider only click through data, tags, comments and explicit user feedback in this computation. On image sharing websites, however, many images are annotated with no tags and initially, an image has no interaction data. In this case, these existing approaches fail due to lack of evidence. In this paper, we therefore focus on image popularity prediction in a cold start scenario (i.e. where there exist no, or limited, textual/interaction data), by considering an image's context, visual appearance and user context. Specifically, we predict the number of comments and views an image has based on a number of new features for this propose. Experimenting on the MIR-Flickr 1M collection, we are able to overcome the problems associated with popularity prediction in a cold start, achieving accuracy of up to 76%

    Collections for automatic image annotation and photo tag recommendation

    No full text
    This paper highlights a number of problems which exist in the evaluation of existing image annotation and tag recommendation methods. Crucially, the collections used by these state-of-the-art methods contain a number of biases which may be exploited or detrimental to their evaluation, resulting in misleading results. In total we highlight seven issues for three popular annotation evaluation collections, i.e. Corel5k, ESP Game and IAPR, as well as three issues with collections used in two state-of-the-art photo tag recommendation methods. The result of this paper is two freely available Flickr image collections designed for the fair evaluation of image annotation and tag recommendation methods called Flickr-AIA and Flickr-PTR respectively. We show through experimentation and demonstration that these collection are ultimately fairer benchmarks than existing collections

    On contextual photo tag recommendation

    No full text
    Image tagging is a growing application on social media websites, however, the performance of many auto-tagging methods are often poor. Recent work has exploited an image's context (e.g. time and location) in the tag recommendation process, where tags which co-occur highly within a given time interval or geographical area are promoted. These models, however, fail to address how and when different image contexts can be combined. In this paper, we propose a weighted tag recommendation model, building on an existing state-of-the-art, which varies the importance of time and location in the recommendation process, based on a given set of input tags. By retrieving more temporally and geographically relevant tags, we achieve statistically significant improvements to recommendation accuracy when testing on 519k images collected from Flickr. The result of this paper is an important step towards more effective image annotation and retrieval systems

    "My Day in Review": Visually Summarising Noisy Lifelog Data

    No full text
    Lifelogging devices, which seamlessly gather various data about a user as they go about their daily life, have resulted in users amassing large collections of noisy photographs (e.g. visual duplicates, image blur), which are difficult to navigate, especially if they want to review their day in photographs. Social media websites, such as Facebook, have faced a similar information overload problem for which a number of summarization methods have been proposed (e.g. news story clustering, comment ranking etc.). In particular, Facebook's Year in Review received much user interest where the objective for the model was to identify key moments in a user's year, offering an automatic visual summary based on their uploaded content. In this paper, we follow this notion by automatically creating a review of a user's day using lifelogging images. Specifically, we address the quality issues faced by the photographs taken on lifelogging devices and attempt to create visual summaries by promoting visual and temporal-spatial diversity in the top ranks. Conducting two crowdsourced evaluations based on 9k images, we show the merits of combining time, location and visual appearance for summarization purposes
    corecore