38 research outputs found

    Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

    Get PDF
    This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly

    Object Recognition and Parsing with Weak Supervision

    Get PDF
    Object recognition is a fundamental problem in computer vision and has attracted a lot of research attention, while object parsing is equally important for many computer vision tasks but has been less studied. With the recent development of deep neural networks, computer vision researches have been dominated by deep learning approaches, which require large amount of training data for a specific task in a specific domain. The cost of collecting rare samples and making "hard" labels is forbiddingly high and has limited the development of many important vision studies, including object parsing. This dissertation will focus on object recognition and parsing with weak supervision, which tackles the problem when only a limited amount of data or label are available for training deep neural networks in the target domain. The goal is to design more advanced computer vision models with enhanced data efficiency during training and increased robustness to out-of-distribution samples during test. To achieve this goal, I will introduce several strategies, including unsupervised learning of compositional components in deep neural networks, zero/few-shot learning by preserving useful knowledge acquired in pre-training, weakly supervised learning combined with spatial-temporal information in video data, and learning from 3D computer graphics models and synthetic data. Furthermore, I will discuss new findings in our cognitive science projects and explain how the part-based representations benefit the development of visual analogical reasoning models. I believe this series of works alleviates the data-hungry problem of deep neural networks, and improves computer vision models to behave closer to human intelligence

    Automatic Image Annotation using Image Clustering in Multi – Agent Society

    Get PDF
    The rapid growth of the internet provides tremendous resource for information in different domains (text, image, voice, and many others). This growth introduces new challenge to hit an exact match due to huge number of document returned by search engines where millions of items can be returned for certain subject. Images have been important resources for information, and billions of images are searched to fulfill user demands, which face the mentioned challenge. Automatic image annotation is a promising methodology for image retrieval. However most current annotation models are not yet sophisticated enough to produce high quality annotations. This thesis presents online intelligent indexing for image repositories based on their contents, although content based indexing and retrieving systems have been introduced, this thesis is adding an intelligent technique to re-index images upon better understanding for its composed concepts. Collaborative Agent scheme has been developed to promote objects of an image to concepts and re-index it according to domain specifications. Also this thesis presents automatic annotation system based on the interaction between intelligent agents. Agent interaction is synonym to socialization behavior dominating Agent society. The presented system is exploiting knowledge evolution revenue due to the socialization to charge up the annotation process

    Modeling Visual Rhetoric and Semantics in Multimedia

    Get PDF
    Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

    Multi-view representation learning for natural language processing applications

    Get PDF
    The pervasion of machine learning in a vast number of applications has given rise to an increasing demand for the effective processing of complex, diverse and variable datasets. One representative case of data diversity can be found in multi-view datasets, which contain input originating from more than one source or having multiple aspects or facets. Examples include, but are not restricted to, multimodal datasets, where data may consist of audio, image and/or text. The nature of multi-view datasets calls for special treatment in terms of representation. A subsequent fundamental problem is that of combining information from potentially incoherent sources; a problem commonly referred to as view fusion. Quite often, the heuristic solution of early fusion is applied to this problem: aggregating representations from different views using a simple function (concatenation, summation or mean pooling). However, early fusion can cause overfitting in the case of small training samples and also, it may result in specific statistical properties of each view being lost in the learning process. Representation learning, the set of ideas and algorithms devised to learn meaningful representations for machine learning problems, has recently grown to a vibrant research field, that encompasses multiple view setups. A plethora of multi-view representation learning methods has been proposed in the literature, with a large portion of them being based on the idea of maximising the correlation between available views. Commonly, such techniques are evaluated on synthetic datasets or strictly defined benchmark setups; a role that, within Natural Language Processing, is often assumed by the multimodal sentiment analysis problem. This thesis argues that more complex downstream applications could benefit from such representations and describes a multi-view contemplation of a range of tasks, from static, two-view, unimodal to dynamic, three-view, trimodal applications.setting out to explore the limits of the seeming applicability of multi-view representation learning More specifically, we experiment with document summarisation, framing it as a multi-view problem where documents and summaries are considered two separate, textual views. Moreover, we present a multi-view inference algorithm for the bimodal problem of image captioning. Delving more into multimodal setups, we develop a set of multi-view models for applications pertaining to videos, including tagging and text generation tasks. Finally, we introduce narration generation, a new text generation task from movie videos, that requires inference on the storyline level and temporal context-based reasoning. The main argument of the thesis is that, due to their performance, multi-view representation learning tools warrant serious consideration by the researchers and practitioners of the Natural Language Processing community. Exploring the limits of multi-view representations, we investigate their fitness for Natural Language Processing tasks and show that they are able to hold information required for complex problems, while being a good alternative to the early fusion paradigm

    Deliverable D2.3 Specification of Web mining process for hypervideo concept identification

    Get PDF
    This deliverable presents a state-of-art and requirements analysis report for the web mining process as part of the WP2 of the LinkedTV project. The deliverable is divided into two subject areas: a) Named Entity Recognition (NER) and b) retrieval of additional content. The introduction gives an outline of the workflow of the work package, with a subsection devoted to relations with other work packages. The state-of-art review is focused on prospective techniques for LinkedTV. In the NER domain, the main focus is on knowledge-based approaches, which facilitate disambiguation of identified entities using linked open data. As part of the NER requirement analysis, the first tools developed are described and evaluated (NERD, SemiTags and THD). The area of linked additional content is broader and requires a more thorough analysis. A balanced overview of techniques for dealing with the various knowledge sources (semantic web resources, web APIs and completely unstructured resources from a white list of web sites) is presented. The requirements analysis comes out of the RBB and Sound and Vision LinkedTV scenarios

    Crazy by design : brain research and adolescence : implications for classroom teaching, teacher learning and possibilities of teacher research

    Get PDF
    This research aims to influence teacher understandings of brain research and its implications for teaching adolescents by addressing the following issues: 1. What are the implications of changes in the adolescent brain for teaching and teachers and the adolescent learning environment? 2. How can teachers better accommodate knowledge of the brain into their understandings and pedagogical practices for adolescents? 3. What can the use of a teacher-as-researcher model contribute to teacher learning in understanding brain research and the adolescent learning environment? To address these questions, this research aimed to: 1. Design, implement and evaluate a teacher learning package that would fill a gap in teacher knowledge by strengthening teacher knowledge of current brain research and deepen teacher understanding of the connection between this research and the adolescent learning environment. 2. Support a team of teachers to use an action research methodology to apply brain-research-informed pedagogical practices, learning tools and ‘essential understandings’ of adolescents in mainstream adolescent educational learning environments to improve educational experience and success. 3. Develop a further teacher learning package that: i) Builds the capacity of teachers outside of my research, and leaders of teachers, to implement action research processes in their own context to improve practice. ii) Describes how teachers at Purple High School (PHS) worked as teacher researchers to use brain research to improve the educational experience and success of adolescent learners, and what they learned about action research as teacher learning. This research addresses these aims and questions by telling the story of three inter-related projects. It engaged with three areas: with neuroscience, with teacher-as-researcher and with the teacher-learning literature and research and built connections to teacher praxis

    Region-based Multimedia Indexing and Retrieval Framework

    Get PDF
    Many systems have been proposed for automatic description and indexing of digital data, for posterior retrieval. One of such content-based indexing-and-retrieval systems, and the one used as a framework in this thesis, is the MUVIS system, which was developed at Tampere University of Technology, in Finland. Moreover, Content-based Image Retrieval (CBIR) utilising frame-based and region-based features has been a dynamic research area in the past years. Several systems have been developed using their specific segmentation, feature extraction, and retrieval methods. In this thesis, a framework to model a regionalised CBIR framework is presented. The framework does not specify or fix the segmentation and local feature extraction methods, which are instead considered as “black-boxes” so as to allow the application of any segmentation method and visual descriptor. The proposed framework adopts a grouping approach in order to correct possible over- segmentation faults and a spatial feature called region proximity is introduced to describe regions topology in a visual scene by a block-based approach. Using the MUVIS system, a prototype system of the proposed framework is implemented as a region-based feature extraction module, which integrates simple colour segmentation and region-based feature description based on colour and texture. The spatial region proximity feature represents regions and describes their topology by a novel metric proposed in this thesis based on the block-based approach and average distance calculation. After the region-based feature extraction step, a feature vector is formed which holds information about all image regions with their local low-level and spatial properties. During the retrieval process, those feature vectors are used for computing the (dis-)similarity distances between two images, taking into account each of their individual components. In this case a many-to-one matching scheme between regions characterised by a similarity maximisation approach is integrated into a query-by-example scheme. Retrieval performance is evaluated between frame-based feature combination and the proposed framework with two different grouping approaches. Experiments are carried out on synthetic and natural image databases and the results indicate that a promising retrieval performance can be obtained as long as a reasonable segmentation quality is obtained. The integration of the region proximity feature further improves the retrieval performance especially for divisible, object-based image content. Finally, frame-based and region-based texture extraction schemes are compared to evaluate the effect of a region on the texture description and retrieval performance utilising the proposed framework. Results show that significant degradations over the retrieval performance occur on region-based texture descriptors compared with the frame-based approaches
    corecore