17,146 research outputs found

    Practical Block-wise Neural Network Architecture Generation

    Full text link
    Convolutional neural networks have gained a remarkable success in computer vision. However, most usable network architectures are hand-crafted and usually require expertise and elaborate design. In this paper, we provide a block-wise network generation pipeline called BlockQNN which automatically builds high-performance networks using the Q-Learning paradigm with epsilon-greedy exploration strategy. The optimal network block is constructed by the learning agent which is trained sequentially to choose component layers. We stack the block to construct the whole auto-generated network. To accelerate the generation process, we also propose a distributed asynchronous framework and an early stop strategy. The block-wise generation brings unique advantages: (1) it performs competitive results in comparison to the hand-crafted state-of-the-art networks on image classification, additionally, the best network generated by BlockQNN achieves 3.54% top-1 error rate on CIFAR-10 which beats all existing auto-generate networks. (2) in the meanwhile, it offers tremendous reduction of the search space in designing networks which only spends 3 days with 32 GPUs, and (3) moreover, it has strong generalizability that the network built on CIFAR also performs well on a larger-scale ImageNet dataset.Comment: Accepted to CVPR 201

    BSML: A Binding Schema Markup Language for Data Interchange in Problem Solving Environments (PSEs)

    Full text link
    We describe a binding schema markup language (BSML) for describing data interchange between scientific codes. Such a facility is an important constituent of scientific problem solving environments (PSEs). BSML is designed to integrate with a PSE or application composition system that views model specification and execution as a problem of managing semistructured data. The data interchange problem is addressed by three techniques for processing semistructured data: validation, binding, and conversion. We present BSML and describe its application to a PSE for wireless communications system design

    PixColor: Pixel Recursive Colorization

    Full text link
    We propose a novel approach to automatically produce multiple colorized versions of a grayscale image. Our method results from the observation that the task of automated colorization is relatively easy given a low-resolution version of the color image. We first train a conditional PixelCNN to generate a low resolution color for a given grayscale image. Then, given the generated low-resolution color image and the original grayscale image as inputs, we train a second CNN to generate a high-resolution colorization of an image. We demonstrate that our approach produces more diverse and plausible colorizations than existing methods, as judged by human raters in a "Visual Turing Test"

    Cache-aware Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    Countless applications cast their computational core in terms of dense linear algebra operations. These operations can usually be implemented by combining the routines offered by standard linear algebra libraries such as BLAS and LAPACK, and typically each operation can be obtained in many alternative ways. Interestingly, identifying the fastest implementation -- without executing it -- is a challenging task even for experts. An equally challenging task is that of tuning each routine to performance-optimal configurations. Indeed, the problem is so difficult that even the default values provided by the libraries are often considerably suboptimal; as a solution, normally one has to resort to executing and timing the routines, driven by some form of parameter search. In this paper, we discuss a methodology to solve both problems: identifying the best performing algorithm within a family of alternatives, and tuning algorithmic parameters for maximum performance; in both cases, we do not execute the algorithms themselves. Instead, our methodology relies on timing and modeling the computational kernels underlying the algorithms, and on a technique for tracking the contents of the CPU cache. In general, our performance predictions allow us to tune dense linear algebra algorithms within few percents from the best attainable results, thus allowing computational scientists and code developers alike to efficiently optimize their linear algebra routines and codes.Comment: Submitted to PMBS1

    CU2CL: A CUDA-to-OpenCL Translator for Multi- and Many-core Architectures

    Get PDF
    The use of graphics processing units (GPUs) in high-performance parallel computing continues to become more prevalent, often as part of a heterogeneous system. For years, CUDA has been the de facto programming environment for nearly all general-purpose GPU (GPGPU) applications. In spite of this, the framework is available only on NVIDIA GPUs, traditionally requiring reimplementation in other frameworks in order to utilize additional multi- or many-core devices. On the other hand, OpenCL provides an open and vendorneutral programming environment and runtime system. With implementations available for CPUs, GPUs, and other types of accelerators, OpenCL therefore holds the promise of a “write once, run anywhere” ecosystem for heterogeneous computing. Given the many similarities between CUDA and OpenCL, manually porting a CUDA application to OpenCL is typically straightforward, albeit tedious and error-prone. In response to this issue, we created CU2CL, an automated CUDA-to- OpenCL source-to-source translator that possesses a novel design and clever reuse of the Clang compiler framework. Currently, the CU2CL translator covers the primary constructs found in CUDA runtime API, and we have successfully translated many applications from the CUDA SDK and Rodinia benchmark suite. The performance of our automatically translated applications via CU2CL is on par with their manually ported countparts

    A Massive Data Parallel Computational Framework for Petascale/Exascale Hybrid Computer Systems

    Full text link
    Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA and OpenCL it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge (a library based framework for heterogeneous multi-core systems), Zippy (a framework for parallel execution of codes on multiple GPUs), BSGP (a new programming language for general purpose computation on the GPU) and CUDA-lite (an enhancement to CUDA that transforms code based on annotations). In addition, efforts are underway to improve compiler tools for automatic parallelization and optimization of affine loop nests for GPUs and for automatic translation of OpenMP parallelized codes to CUDA. In this paper we present an alternative approach: a new computational framework for the development of massively data parallel scientific codes applications suitable for use on such petascale/exascale hybrid systems built upon the highly scalable Cactus framework. As the first non-trivial demonstration of its usefulness, we successfully developed a new 3D CFD code that achieves improved performance.Comment: Parallel Computing 2011 (ParCo2011), 30 August -- 2 September 2011, Ghent, Belgiu

    Formal Compiler Implementation in a Logical Framework

    Get PDF
    The task of designing and implementing a compiler can be a difficult and error-prone process. In this paper, we present a new approach based on the use of higher-order abstract syntax and term rewriting in a logical framework. All program transformations, from parsing to code generation, are cleanly isolated and specified as term rewrites. This has several advantages. The correctness of the compiler depends solely on a small set of rewrite rules that are written in the language of formal mathematics. In addition, the logical framework guarantees the preservation of scoping, and it automates many frequently-occurring tasks including substitution and rewriting strategies. As we show, compiler development in a logical framework can be easier than in a general-purpose language like ML, in part because of automation, and also because the framework provides extensive support for examination, validation, and debugging of the compiler transformations. The paper is organized around a case study, using the MetaPRL logical framework to compile an ML-like language to Intel x86 assembly. We also present a scoped formalization of x86 assembly in which all registers are immutable
    • …
    corecore