108 research outputs found

    Visual Storytelling: Captioning of Image Sequences

    Get PDF
    In the space of automated captioning, the task of visual storytelling is a dimension. Given sequences of images as inputs, visual storytelling (VIST) is about automatically generating textual narratives as outputs. Automatically producing stories for an order of pictures or video frames have several potential applications in diverse domains ranging from multimedia consumption to autonomous systems. The task has evolved over recent years and is moving into adolescence. The availability of a dedicated VIST dataset for the task has mainstreamed research for visual storytelling and related sub-tasks. This thesis work systematically reports the developments of standard captioning as a parent task with accompanying facets like dense captioning and gradually delves into the domain of visual storytelling. Existing models proposed for VIST are described by examining respective characteristics and scope. All the methods for VIST adapt from the typical encoder-decoder style design, owing to its success in addressing the standard image captioning task. Several subtle differences in the underlying intentions of these methods for approaching the VIST are subsequently summarized. Additionally, alternate perspectives around the existing approaches are explored by re-modeling and modifying their learning mechanisms. Experiments with different objective functions are reported with subjective comparisons and relevant results. Eventually, the sub-field of character relationships within storytelling is studied and a novel idea called character-centric storytelling is proposed to account for prospective characters in the extent of data modalities

    Visual attribute discovery and analyses from Web data

    Get PDF
    Visual attributes are important for describing and understanding an object’s appearance. For an object classification or recognition task, an algorithm needs to infer the visual attributes of an object to compare, categorize or recognize the objects. In a zero-shot learning scenario, the algorithm depends on the visual attributes to describe an unknown object since the training samples are not available. Because different object categories usually share some common attributes (e.g. many animals have four legs, a tail and fur), the act of explicitly modeling attributes not only allows previously learnt attributes to be transferred to a novel category but also reduces the number of training samples for the new category which can be important when the number of training samples is limited. Even though larger numbers of visual attributes help the algorithm to better describe an image, they also require a larger set of training data. In the supervised scenario, data collection can be both a costly and time-consuming process. To mitigate the data collection costs, this dissertation exploits the weakly-supervised data from the Web in order to construct computational methodologies for the discovery of visual attributes, as well as an analysis across time and domains. This dissertation first presents an automatic approach to learning hundreds of visual attributes from the open-world vocabulary on the Web using a convolutional neural network. The proposed method tries to understand visual attributes in terms of perception inside deep neural networks. By focusing on the analysis of neural activations, the system can identify the degree to which an attribute can be visually perceptible and can localize the visual attributes in an image. Moreover, the approach exploits the layered structure of the deep model to determine the semantic depth of the attributes. Beyond visual attribute discovery, this dissertation explores how visual styles (i.e., attributes that correspond to multiple visual concepts) change across time. These are referred to as visual trends. To this goal, this dissertation introduces several deep neural networks for estimating when objects were made together with the analyses of the neural activations and their degree of entropy to gain insights into the deep network. To utilize the dating of the historical object frameworks in real-world applications, the dating frameworks are applied to analyze the influence of vintage fashion on runway collections, as well as to analyze the influence of fashion on runway collections and on street fashion. Finally, this dissertation introduces an approach to recognizing and transferring visual attributes across domains in a realistic manner. Given two input images from two different domains: 1) a shopping image, and 2) a scene image, this dissertation proposes a generative adversarial network for transferring the product pixels from the shopping image to the scene image such that: 1) the output image looks realistic and 2) the visual attributes of the product are preserved. In summary, this dissertation utilizes the weakly-supervised data from the Web for the purposes of visual attribute discovery and an analysis across time and domains. Beyond the novel computational methodology for each problem, this dissertation demonstrates that the proposed approaches can be applied to many real-world applications such as dating historical objects, visual trend prediction and analysis, cross-domain image label transfer, cross-domain pixel transfer for home decoration, among others.Doctor of Philosoph

    Identifying Restaurants Proposing Novel Kinds of Cuisines: Using Yelp Reviews

    Get PDF
    These days with TV-shows and starred chefs, new kinds of cuisines appear in the market. The main cuisines like French, Italian, Japanese, Chinese and Indian are always appreciated but they are no longer the most popular. The new trend is the fusion cuisine, which is obtained by combining different main cuisines. The opening of a new restaurant proposing new kinds of cuisine produces a lot of excitement in people. They feel the need to try it and be part of this new culture. Yelp is a platform which publishes crowd sourced reviews about different businesses, in particular, restaurants. For some restaurants in Yelp if the kind of cuisine is available, usually, there is a tag only for the main cuisines, but there is no information for the fusion cuisine. There is a need to develop a system which is able to identify restaurants proposing fusion cuisine (novel or unknown cuisines). This proposal is to address the novelty detection task using Yelp reviews. The idea is that the semi-supervised Machine Learning models trained only on the reviews of restaurants proposing the main cuisine will be able to discriminate between restaurants providing the main cuisine and restaurants providing the novel ones. We propose effective novelty detection approaches for the unknown cuisine type identification problem using Long Short Term Memory (LSTM), autoencoder and Term-Frequency and Inverse Document Frequency(). Our main idea is to obtain features from LSTM, autoencoder and TF-IDF and use these features with standard semi-supervised novelty detection algorithms like Gaussian Mixture Model, Isolation Forest and One-class Support Vector Machines (SVM) to identify the unknown cuisines. We conducted extensive experiments that prove the effectiveness of our approaches. The score that we obtained has a very high discrimination power because the best value of AUROC for the novelty detection problem is 0.85 from LSTM. LSTM outperforms our baseline model of TF-IDF and the main motivation is due to its ability to retain only the useful parts of a sentence

    Deep learning in medical imaging and radiation therapy

    Full text link
    Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/146980/1/mp13264_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/146980/2/mp13264.pd

    Multimedia Forensics

    Get PDF
    This book is open access. Media forensics has never been more relevant to societal life. Not only media content represents an ever-increasing share of the data traveling on the net and the preferred communications means for most users, it has also become integral part of most innovative applications in the digital information ecosystem that serves various sectors of society, from the entertainment, to journalism, to politics. Undoubtedly, the advances in deep learning and computational imaging contributed significantly to this outcome. The underlying technologies that drive this trend, however, also pose a profound challenge in establishing trust in what we see, hear, and read, and make media content the preferred target of malicious attacks. In this new threat landscape powered by innovative imaging technologies and sophisticated tools, based on autoencoders and generative adversarial networks, this book fills an important gap. It presents a comprehensive review of state-of-the-art forensics capabilities that relate to media attribution, integrity and authenticity verification, and counter forensics. Its content is developed to provide practitioners, researchers, photo and video enthusiasts, and students a holistic view of the field
    • …
    corecore