5,745 research outputs found
Microgrid - The microthreaded many-core architecture
Traditional processors use the von Neumann execution model, some other
processors in the past have used the dataflow execution model. A combination of
von Neuman model and dataflow model is also tried in the past and the resultant
model is referred as hybrid dataflow execution model. We describe a hybrid
dataflow model known as the microthreading. It provides constructs for
creation, synchronization and communication between threads in an intermediate
language. The microthreading model is an abstract programming and machine model
for many-core architecture. A particular instance of this model is named as the
microthreaded architecture or the Microgrid. This architecture implements all
the concurrency constructs of the microthreading model in the hardware with the
management of these constructs in the hardware.Comment: 30 pages, 16 figure
Accelerating sequential programs using FastFlow and self-offloading
FastFlow is a programming environment specifically targeting cache-coherent
shared-memory multi-cores. FastFlow is implemented as a stack of C++ template
libraries built on top of lock-free (fence-free) synchronization mechanisms. In
this paper we present a further evolution of FastFlow enabling programmers to
offload part of their workload on a dynamically created software accelerator
running on unused CPUs. The offloaded function can be easily derived from
pre-existing sequential code. We emphasize in particular the effective
trade-off between human productivity and execution efficiency of the approach.Comment: 17 pages + cove
Ringo: Interactive Graph Analytics on Big-Memory Machines
We present Ringo, a system for analysis of large graphs. Graphs provide a way
to represent and analyze systems of interacting objects (people, proteins,
webpages) with edges between the objects denoting interactions (friendships,
physical interactions, links). Mining graphs provides valuable insights about
individual objects as well as the relationships among them.
In building Ringo, we take advantage of the fact that machines with large
memory and many cores are widely available and also relatively affordable. This
allows us to build an easy-to-use interactive high-performance graph analytics
system. Graphs also need to be built from input data, which often resides in
the form of relational tables. Thus, Ringo provides rich functionality for
manipulating raw input data tables into various kinds of graphs. Furthermore,
Ringo also provides over 200 graph analytics functions that can then be applied
to constructed graphs.
We show that a single big-memory machine provides a very attractive platform
for performing analytics on all but the largest graphs as it offers excellent
performance and ease of use as compared to alternative approaches. With Ringo,
we also demonstrate how to integrate graph analytics with an iterative process
of trial-and-error data exploration and rapid experimentation, common in data
mining workloads.Comment: 6 pages, 2 figure
FLICK: developing and running application-specific network services
Data centre networks are increasingly programmable, with application-specific network services proliferating, from custom load-balancers to middleboxes providing caching and aggregation. Developers must currently implement these services using traditional low-level APIs, which neither support natural operations on application data nor provide efficient performance isolation. We describe FLICK, a framework for the programming and execution of application-specific network services on multi-core CPUs. Developers write network services in the FLICK language, which offers high-level processing constructs and application-relevant data types. FLICK programs are translated automatically to efficient, parallel task graphs, implemented in C++ on top of a user-space TCP stack. Task graphs have bounded resource usage at runtime, which means that the graphs of multiple services can execute concurrently without interference using cooperative scheduling. We evaluate FLICK with several services (an HTTP load-balancer, a Memcached router and a Hadoop data aggregator), showing that it achieves good performance while reducing development effort
S-Net for multi-memory multicores
Copyright ACM, 2010. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in Proceedings of the 5th ACM SIGPLAN Workshop on Declarative Aspects of Multicore Programming: http://doi.acm.org/10.1145/1708046.1708054S-Net is a declarative coordination language and component technology aimed at modern multi-core/many-core architectures and systems-on-chip. It builds on the concept of stream processing to structure dynamically evolving networks of communicating asynchronous components. Components themselves are implemented using a conventional language suitable for the application domain. This two-level software architecture maintains a familiar sequential development environment for large parts of an application and offers a high-level declarative approach to component coordination. In this paper we present a conservative language extension for the placement of components and component networks in a multi-memory environment, i.e. architectures that associate individual compute cores or groups thereof with private memories. We describe a novel distributed runtime system layer that complements our existing multithreaded runtime system for shared memory multicores. Particular emphasis is put on efficient management of data communication. Last not least, we present preliminary experimental data
HSTREAM: A directive-based language extension for heterogeneous stream computing
Big data streaming applications require utilization of heterogeneous parallel
computing systems, which may comprise multiple multi-core CPUs and many-core
accelerating devices such as NVIDIA GPUs and Intel Xeon Phis. Programming such
systems require advanced knowledge of several hardware architectures and
device-specific programming models, including OpenMP and CUDA. In this paper,
we present HSTREAM, a compiler directive-based language extension to support
programming stream computing applications for heterogeneous parallel computing
systems. HSTREAM source-to-source compiler aims to increase the programming
productivity by enabling programmers to annotate the parallel regions for
heterogeneous execution and generate target specific code. The HSTREAM runtime
automatically distributes the workload across CPUs and accelerating devices. We
demonstrate the usefulness of HSTREAM language extension with various
applications from the STREAM benchmark. Experimental evaluation results show
that HSTREAM can keep the same programming simplicity as OpenMP, and the
generated code can deliver performance beyond what CPUs-only and GPUs-only
executions can deliver.Comment: Preprint, 21st IEEE International Conference on Computational Science
and Engineering (CSE 2018
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
- …