28 research outputs found
Exploring the structure of a real-time, arbitrary neural artistic stylization network
In this paper, we present a method which combines the flexibility of the
neural algorithm of artistic style with the speed of fast style transfer
networks to allow real-time stylization using any content/style image pair. We
build upon recent work leveraging conditional instance normalization for
multi-style transfer networks by learning to predict the conditional instance
normalization parameters directly from a style image. The model is successfully
trained on a corpus of roughly 80,000 paintings and is able to generalize to
paintings previously unobserved. We demonstrate that the learned embedding
space is smooth and contains a rich structure and organizes semantic
information associated with paintings in an entirely unsupervised manner.Comment: Accepted as an oral presentation at British Machine Vision Conference
(BMVC) 201
Streamroller : A Unified Compilation and Synthesis System for Streaming Applications.
The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power,and high performance implementations of these applications in the form of custom
hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures.
This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of
custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like
limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor.
Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by
efficiently sharing hardware between multiple kernels.
Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is
presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61662/1/kvman_1.pd
Dynamic Control Flow in Large-Scale Machine Learning
Many recent machine learning models rely on fine-grained dynamic control flow
for training and inference. In particular, models based on recurrent neural
networks and on reinforcement learning depend on recurrence relations,
data-dependent conditional execution, and other features that call for dynamic
control flow. These applications benefit from the ability to make rapid
control-flow decisions across a set of computing devices in a distributed
system. For performance, scalability, and expressiveness, a machine learning
system must support dynamic control flow in distributed and heterogeneous
environments.
This paper presents a programming model for distributed machine learning that
supports dynamic control flow. We describe the design of the programming model,
and its implementation in TensorFlow, a distributed machine learning system.
Our approach extends the use of dataflow graphs to represent machine learning
models, offering several distinctive features. First, the branches of
conditionals and bodies of loops can be partitioned across many machines to run
on a set of heterogeneous devices, including CPUs, GPUs, and custom ASICs.
Second, programs written in our model support automatic differentiation and
distributed gradient computations, which are necessary for training machine
learning models that use control flow. Third, our choice of non-strict
semantics enables multiple loop iterations to execute in parallel across
machines, and to overlap compute and I/O operations.
We have done our work in the context of TensorFlow, and it has been used
extensively in research and production. We evaluate it using several real-world
applications, and demonstrate its performance and scalability.Comment: Appeared in EuroSys 2018. 14 pages, 16 figure
Orchestrating the execution of stream programs on multicore platforms
While multicore hardware has become ubiquitous, explicitly parallel programming models and compiler techniques for exploiting parallelism on these systems have noticeably lagged behind. Stream programming is one model that has wide applicability in the multimedia, graphics, and signal processing domains. Streaming models execute as a set of independent actors that explicitly communicate data through channels. This paper presents a compiler technique for planning and orchestrating the execution of streaming applications on multicore platforms. An integrated unfolding and partitioning step based on integer linear programming is presented that unfolds data parallel actors as needed and maximally packs actors onto cores. Next, the actors are assigned to pipeline stages in such a way that all communication is maximally overlapped with computation on the cores. To facilitate experimentation, a generalized code generation template for mapping the software pipeline onto the Cell architecture is presented. For a range of streaming applications, a geometric mean speedup of 14.7x is achieved on a 16-core Cell platform compared to a single core
Performance analysis of methods that overcome false sharing effects in software DSMs
Page-based software DSMs experience high degrees of false sharingespecially in irregular applications with fine grain sharinggranularity. The overheads due to false sharing is considered to be adominant factor limiting the performance of software DSMs. Severalapproaches have been proposed in the literature to reduce/eliminatefalse sharing. In this paper, we evaluate two of these approaches,viz., the Multiple Writer approach and the emulated fine grain sharing(EmFiGS) approach. Our evaluation strategy is two pronged. First, weuse an implementation-independent analysis that uses overhead counts tocompare the different approaches. Our analysis show that the benefitsgained by eliminating false sharing are far outweighed by theperformance penalty incurred due to the reduced exploitation of spatiallocality in the EmFiGS approach. As a consequence, any implementationof the EmFiGS approach is likely to perform significantly worse thanthe Multiple Writer approach. Second, we use experimental evaluation tovalidate and complement our analysis. The experimental results matchwell with our analysis. Also the execution times of the applicationfollow the same trend as in our analysis, reinforcing our conclusions.More specifically, the performance of the EmFiGS approach issignificantly worse, by a factor of 1.5 to as much as 90 times,compared to the Multiple Writer approach. In many cases, the EmFiGSapproach performs worse than even a single writer lazy release protocolwhich experiences very high overheads due to false sharing.
The performance of the EmFiGS approach remains worse than the MultipleWriter approach even after incorporating Tapeworm-a record and replaytechnique that fetches pages ahead of demand in an aggregatedfashion-to alleviate the spatial locality effect. We next present theeffect of asynchronous message handling on the performance of differentmethods. Finally, we investigate the inter-play between spatiallocality exploitation and false sharing elimination with varyingsharing granularities in the EmFiGS approach and report the tradeoffs
Designing a Unified Programming Model for Heterogeneous Machines
Abstract—While high-efficiency machines are increasingly embracing heterogeneous architectures and massive multithreading, contemporary mainstream programming languages reflect a mental model in which processing elements are homogeneous, concurrency is limited, and memory is a flat undifferentiated pool of storage. Moreover, the current state of the art in programming heterogeneous machines tends towards using separate programming models, such as OpenMP and CUDA, for different portions of the machine. Both of these factors make programming emerging heterogeneous machines unnecessarily difficult. We describe the design of the Phalanx programming model, which seeks to provide a unified programming model for heterogeneous machines. It provides constructs for bulk parallelism, synchronization, and data placement which operate across the entire machine. Our prototype implementation is able to launch and coordinate work on both CPU and GPU processors within a single node, and by leveraging the GASNet runtime, is able to run across all the nodes of a distributed-memory machine. I
Streamroller: Automatic synthesis of prescribed throughput accelerator pipelines
In this paper, we present a methodology for designing a pipeline of accelerators for an application. The application is modeled using sequential C language with simple stylizations. The synthesis of the accelerator pipeline involves designing loop accelerators for individual kernels, instantiating buffers for arrays used in the application, and hooking up these building blocks to form a pipeline. A compiler-based system automatically synthesizes loop accelerators for individual kernels at varying performance levels. An integer linear program formulation which simultaneously optimizes the cost of loop accelerators and the cost of memory buffers is proposed to compose the loop accelerators to form an accelerator pipeline for the whole application. Cases studies for some applications, including FMRadio and Beamformer, are presented to illustrate our design methodology. Experiments show significant cost savings are achieved through hardware sharing, while achieving the prescribed throughput requirements