963 research outputs found

    Learning Convolutional Text Representations for Visual Question Answering

    Full text link
    Visual question answering is a recently proposed artificial intelligence task that requires a deep understanding of both images and texts. In deep learning, images are typically modeled through convolutional neural networks, and texts are typically modeled through recurrent neural networks. While the requirement for modeling images is similar to traditional computer vision tasks, such as object recognition and image classification, visual question answering raises a different need for textual representation as compared to other natural language processing tasks. In this work, we perform a detailed analysis on natural language questions in visual question answering. Based on the analysis, we propose to rely on convolutional neural networks for learning textual representations. By exploring the various properties of convolutional neural networks specialized for text data, such as width and depth, we present our "CNN Inception + Gate" model. We show that our model improves question representations and thus the overall accuracy of visual question answering models. We also show that the text representation requirement in visual question answering is more complicated and comprehensive than that in conventional natural language processing tasks, making it a better task to evaluate textual representation methods. Shallow models like fastText, which can obtain comparable results with deep learning models in tasks like text classification, are not suitable in visual question answering.Comment: Conference paper at SDM 2018. https://github.com/divelab/sva

    A structural representation for understanding line-drawing images

    Get PDF
    International audienceIn this paper, we are concerned with the problem of finding a good and homogeneous representation to encode line-drawing documents (which may be handwritten). We propose a method in which the problems induced by a first-step skeletonization have been avoided. First, we vectorize the image, to get a fine description of the drawing, using only vectors and quadrilateral primitives. A structural graph is built with the primitives extracted from the initial line-drawing image. The objective is to manage attributes relative to elementary objects so as to provide a description of the spatial relationships (inclusion, junction, intersection, etc.) that exist between the graphics in the images. This is done with a representation that provides a global vision of the drawings. The capacity of the representation to evolve and to carry highly semantic information is also highlighted. Finally, we show how an architecture using this structural representation and a mechanism of perceptive cycles can lead to a high-quality interpretation of line drawings

    Accelerating Pattern Matching in Neuromorphic Text Recognition System Using Intel Xeon Phi Coprocessor

    Get PDF
    Neuromorphic computing systems refer to the computing architecture inspired by the working mechanism of human brains. The rapidly reducing cost and increasing performance of state-of-the-art computing hardware allows large-scale implementation of machine intelligence models with neuromorphic architectures and opens the opportunity for new applications. One such computing hardware is Intel Xeon Phi coprocessor, which delivers over a TeraFLOP of computing power with 61 integrated processing cores. How to efficiently harness such computing power to achieve real time decision and cognition is one of the key design considerations. This work presents an optimized implementation of Brain-State-in-a-Box (BSB) neural network model on the Xeon Phi coprocessor for pattern matching in the context of intelligent text recognition of noisy document images. From a scalability standpoint on a High Performance Computing (HPC) platform we show that efficient workload partitioning and resource management can double the performance of this many-core architecture for neuromorphic applications

    Computing and Displaying Isosurfaces in R

    Get PDF
    This paper presents R utilities for computing and displaying isosurfaces, or three-dimensional contour surfaces, from a three-dimensional array of function values. A version of the marching cubes algorithm that takes into account face and internal ambiguities is used to compute the isosurfaces. Vectorization is used to ensure adequate performance using only R code. Examples are presented showing contours of theoretical densities, density estimates, and medical imaging data. Rendering can use the rgl package or standard or grid graphics, and a set of tools for representing and rendering surfaces using standard or grid graphics is presented.

    One-to-many face recognition with bilinear CNNs

    Full text link
    The recent explosive growth in convolutional neural network (CNN) research has produced a variety of new architectures for deep learning. One intriguing new architecture is the bilinear CNN (B-CNN), which has shown dramatic performance gains on certain fine-grained recognition problems [15]. We apply this new CNN to the challenging new face recognition benchmark, the IARPA Janus Benchmark A (IJB-A) [12]. It features faces from a large number of identities in challenging real-world conditions. Because the face images were not identified automatically using a computerized face detection system, it does not have the bias inherent in such a database. We demonstrate the performance of the B-CNN model beginning from an AlexNet-style network pre-trained on ImageNet. We then show results for fine-tuning using a moderate-sized and public external database, FaceScrub [17]. We also present results with additional fine-tuning on the limited training data provided by the protocol. In each case, the fine-tuned bilinear model shows substantial improvements over the standard CNN. Finally, we demonstrate how a standard CNN pre-trained on a large face database, the recently released VGG-Face model [20], can be converted into a B-CNN without any additional feature training. This B-CNN improves upon the CNN performance on the IJB-A benchmark, achieving 89.5% rank-1 recall.Comment: Published version at WACV 201

    VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models

    Full text link
    Diffusion models have shown impressive results in text-to-image synthesis. Using massive datasets of captioned images, diffusion models learn to generate raster images of highly diverse objects and scenes. However, designers frequently use vector representations of images like Scalable Vector Graphics (SVGs) for digital icons or art. Vector graphics can be scaled to any size, and are compact. We show that a text-conditioned diffusion model trained on pixel representations of images can be used to generate SVG-exportable vector graphics. We do so without access to large datasets of captioned SVGs. By optimizing a differentiable vector graphics rasterizer, our method, VectorFusion, distills abstract semantic knowledge out of a pretrained diffusion model. Inspired by recent text-to-3D work, we learn an SVG consistent with a caption using Score Distillation Sampling. To accelerate generation and improve fidelity, VectorFusion also initializes from an image sample. Experiments show greater quality than prior work, and demonstrate a range of styles including pixel art and sketches. See our project webpage at https://ajayj.com/vectorfusion .Comment: Project webpage: https://ajayj.com/vectorfusio
    • …
    corecore