14,811 research outputs found

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

    Full text link
    From social networks to language modeling, the growing scale and importance of graph data has driven the development of numerous new graph-parallel systems (e.g., Pregel, GraphLab). By restricting the computation that can be expressed and introducing new techniques to partition and distribute the graph, these systems can efficiently execute iterative graph algorithms orders of magnitude faster than more general data-parallel systems. However, the same restrictions that enable the performance gains also make it difficult to express many of the important stages in a typical graph-analytics pipeline: constructing the graph, modifying its structure, or expressing computation that spans multiple graphs. As a consequence, existing graph analytics pipelines compose graph-parallel and data-parallel systems using external storage systems, leading to extensive data movement and complicated programming model. To address these challenges we introduce GraphX, a distributed graph computation framework that unifies graph-parallel and data-parallel computation. GraphX provides a small, core set of graph-parallel operators expressive enough to implement the Pregel and PowerGraph abstractions, yet simple enough to be cast in relational algebra. GraphX uses a collection of query optimization techniques such as automatic join rewrites to efficiently implement these graph-parallel operators. We evaluate GraphX on real-world graphs and workloads and demonstrate that GraphX achieves comparable performance as specialized graph computation systems, while outperforming them in end-to-end graph pipelines. Moreover, GraphX achieves a balance between expressiveness, performance, and ease of use

    Dynamic Programming on Nominal Graphs

    Get PDF
    Many optimization problems can be naturally represented as (hyper) graphs, where vertices correspond to variables and edges to tasks, whose cost depends on the values of the adjacent variables. Capitalizing on the structure of the graph, suitable dynamic programming strategies can select certain orders of evaluation of the variables which guarantee to reach both an optimal solution and a minimal size of the tables computed in the optimization process. In this paper we introduce a simple algebraic specification with parallel composition and restriction whose terms up to structural axioms are the graphs mentioned above. In addition, free (unrestricted) vertices are labelled with variables, and the specification includes operations of name permutation with finite support. We show a correspondence between the well-known tree decompositions of graphs and our terms. If an axiom of scope extension is dropped, several (hierarchical) terms actually correspond to the same graph. A suitable graphical structure can be found, corresponding to every hierarchical term. Evaluating such a graphical structure in some target algebra yields a dynamic programming strategy. If the target algebra satisfies the scope extension axiom, then the result does not depend on the particular structure, but only on the original graph. We apply our approach to the parking optimization problem developed in the ASCENS e-mobility case study, in collaboration with Volkswagen. Dynamic programming evaluations are particularly interesting for autonomic systems, where actual behavior often consists of propagating local knowledge to obtain global knowledge and getting it back for local decisions.Comment: In Proceedings GaM 2015, arXiv:1504.0244

    Choreographies in Practice

    Full text link
    Choreographic Programming is a development methodology for concurrent software that guarantees correctness by construction. The key to this paradigm is to disallow mismatched I/O operations in programs, called choreographies, and then mechanically synthesise distributed implementations in terms of standard process models via a mechanism known as EndPoint Projection (EPP). Despite the promise of choreographic programming, there is still a lack of practical evaluations that illustrate the applicability of choreographies to concrete computational problems with standard concurrent solutions. In this work, we explore the potential of choreographies by using Procedural Choreographies (PC), a model that we recently proposed, to write distributed algorithms for sorting (Quicksort), solving linear equations (Gaussian elimination), and computing Fast Fourier Transform. We discuss the lessons learned from this experiment, giving possible directions for the usage and future improvements of choreography languages

    Machine Learning at Microsoft with ML .NET

    Full text link
    Machine Learning is transitioning from an art and science into a technology available to every developer. In the near future, every application on every platform will incorporate trained models to encode data-based decisions that would be impossible for developers to author. This presents a significant engineering challenge, since currently data science and modeling are largely decoupled from standard software development processes. This separation makes incorporating machine learning capabilities inside applications unnecessarily costly and difficult, and furthermore discourage developers from embracing ML in first place. In this paper we present ML .NET, a framework developed at Microsoft over the last decade in response to the challenge of making it easy to ship machine learning models in large software applications. We present its architecture, and illuminate the application demands that shaped it. Specifically, we introduce DataView, the core data abstraction of ML .NET which allows it to capture full predictive pipelines efficiently and consistently across training and inference lifecycles. We close the paper with a surprisingly favorable performance study of ML .NET compared to more recent entrants, and a discussion of some lessons learned

    Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information.

    Get PDF
    The current ubiquity of multi-core processors has brought renewed interest in program parallelization. Logic programs allow studying the parallelization of programs with complex, dynamic data structures with (declarative) pointers in a comparatively simple semantic setting. In this context, automatic parallelizers which exploit and-parallelism rely on notions of independence in order to ensure certain efficiency properties. “Non-strict” independence is a more relaxed notion than the traditional notion of “strict” independence which still ensures the relevant efficiency properties and can allow considerable more parallelism. Non-strict independence cannot be determined solely at run-time (“a priori”) and thus global analysis is a requirement. However, extracting non-strict independence information from available analyses and domains is non-trivial. This paper provides on one hand an extended presentation of our classic techniques for compile-time detection of non-strict independence based on extracting information from (abstract interpretation-based) analyses using the now well understood and popular Sharing + Freeness domain. This includes algorithms for combined compile-time/run-time detection which involve special run-time checks for this type of parallelism. In addition, we propose herein novel annotation (parallelization) algorithms, URLP and CRLP, which are specially suited to non-strict independence. We also propose new ways of using the Sharing + Freeness information to optimize how the run-time environments of goals are kept apart during parallel execution. Finally, we also describe the implementation of these techniques in our parallelizing compiler and recall some early performance results. We provide as well an extended description of our pictorial representation of sharing and freeness information

    Near real-time stereo vision system

    Get PDF
    The apparatus for a near real-time stereo vision system for use with a robotic vehicle is described. The system is comprised of two cameras mounted on three-axis rotation platforms, image-processing boards, a CPU, and specialized stereo vision algorithms. Bandpass-filtered image pyramids are computed, stereo matching is performed by least-squares correlation, and confidence ranges are estimated by means of Bayes' theorem. In particular, Laplacian image pyramids are built and disparity maps are produced from the 60 x 64 level of the pyramids at rates of up to 2 seconds per image pair. The first autonomous cross-country robotic traverses (of up to 100 meters) have been achieved using the stereo vision system of the present invention with all computing done onboard the vehicle. The overall approach disclosed herein provides a unifying paradigm for practical domain-independent stereo ranging
    corecore