3,833 research outputs found

    Unconstrained Scene Text and Video Text Recognition for Arabic Script

    Full text link
    Building robust recognizers for Arabic has always been challenging. We demonstrate the effectiveness of an end-to-end trainable CNN-RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. We outperform previous state-of-the-art on two publicly available video text datasets - ALIF and ACTIV. For the scene text recognition task, we introduce a new Arabic scene text dataset and establish baseline results. For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data. We overcome this by synthesising millions of Arabic text images from a large vocabulary of Arabic words and phrases. Our implementation is built on top of the model introduced here [37] which is proven quite effective for English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult for Arabic script. Further, the ability of RNNs to model contextual dependencies yields superior recognition results.Comment: 5 page

    A Deep Siamese Network for Scene Detection in Broadcast Videos

    Get PDF
    We present a model that automatically divides broadcast videos into coherent scenes by learning a distance measure between shots. Experiments are performed to demonstrate the effectiveness of our approach by comparing our algorithm against recent proposals for automatic scene segmentation. We also propose an improved performance measure that aims to reduce the gap between numerical evaluation and expected results, and propose and release a new benchmark dataset.Comment: ACM Multimedia 201

    Hierarchical Recurrent Neural Network for Story Segmentation

    Get PDF

    Multimodal Content Analysis for Effective Advertisements on YouTube

    Full text link
    The rapid advances in e-commerce and Web 2.0 technologies have greatly increased the impact of commercial advertisements on the general public. As a key enabling technology, a multitude of recommender systems exists which analyzes user features and browsing patterns to recommend appealing advertisements to users. In this work, we seek to study the characteristics or attributes that characterize an effective advertisement and recommend a useful set of features to aid the designing and production processes of commercial advertisements. We analyze the temporal patterns from multimedia content of advertisement videos including auditory, visual and textual components, and study their individual roles and synergies in the success of an advertisement. The objective of this work is then to measure the effectiveness of an advertisement, and to recommend a useful set of features to advertisement designers to make it more successful and approachable to users. Our proposed framework employs the signal processing technique of cross modality feature learning where data streams from different components are employed to train separate neural network models and are then fused together to learn a shared representation. Subsequently, a neural network model trained on this joint feature embedding representation is utilized as a classifier to predict advertisement effectiveness. We validate our approach using subjective ratings from a dedicated user study, the sentiment strength of online viewer comments, and a viewer opinion metric of the ratio of the Likes and Views received by each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201

    An Ensemble Model with Ranking for Social Dialogue

    Full text link
    Open-domain social dialogue is one of the long-standing goals of Artificial Intelligence. This year, the Amazon Alexa Prize challenge was announced for the first time, where real customers get to rate systems developed by leading universities worldwide. The aim of the challenge is to converse "coherently and engagingly with humans on popular topics for 20 minutes". We describe our Alexa Prize system (called 'Alana') consisting of an ensemble of bots, combining rule-based and machine learning systems, and using a contextual ranking mechanism to choose a system response. The ranker was trained on real user feedback received during the competition, where we address the problem of how to train on the noisy and sparse feedback obtained during the competition.Comment: NIPS 2017 Workshop on Conversational A

    Deep Learning for Semantic Video Understanding

    Get PDF
    The field of computer vision has long strived to extract understanding from images and videos sequences. The recent flood of video data along with massive increments in computing power have provided the perfect environment to generate advanced research to extract intelligence from video data. Video data is ubiquitous, occurring in numerous everyday activities such as surveillance, traffic, movies, sports, etc. This massive amount of video needs to be analyzed and processed efficiently to extract semantic features towards video understanding. Such capabilities could benefit surveillance, video analytics and visually challenged people. While watching a long video, humans have the uncanny ability to bypass unnecessary information and concentrate on the important events. These key events can be used as a higher-level description or summary of a long video. Inspired by the human visual cortex, this research affords such abilities in computers using neural networks. Useful or interesting events are first extracted from a video and then deep learning methodologies are used to extract natural language summaries for each video sequence. Previous approaches of video description either have been domain specific or use a template based approach to fill detected objects such as verbs or actions to constitute a grammatically correct sentence. This work involves exploiting temporal contextual information for sentence generation while working on wide domain datasets. Current state-of- the-art video description methodologies are well suited for small video clips whereas this research can also be applied to long sequences of video. This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. End to end video summarization immensely depends on abstractive summarization of video descriptions. State-of- the-art neural language & attention joint models have been used to generate textual summaries. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. This novel approach will be a stepping stone for a variety of innovative applications such as video retrieval, automatic summarization for visually impaired persons, automatic movie review generation, video question and answering systems
    corecore