8,943 research outputs found
Experiences with porting and modelling wavefront algorithms on many-core architectures
We are currently investigating the viability of many-core architectures for the acceleration of wavefront applications and this report focuses on graphics processing units (GPUs) in particular. To this end, we have implemented NASA’s LU benchmark – a real world production-grade application – on GPUs employing NVIDIA’s Compute Unified Device Architecture (CUDA).
This GPU implementation of the benchmark has been used to investigate the performance of a selection of GPUs, ranging from workstation-grade commodity GPUs to the HPC "Tesla” and "Fermi” GPUs. We have also compared the performance of the GPU solution at scale to that of traditional high perfor- mance computing (HPC) clusters based on a range of multi- core CPUs from a number of major vendors, including Intel (Nehalem), AMD (Opteron) and IBM (PowerPC).
In previous work we have developed a predictive “plug-and-play” performance model of this class of application running on such clusters, in which CPUs communicate via the Message Passing Interface (MPI). By extending this model to also capture the performance behaviour of GPUs, we are able to: (1) comment on the effects that architectural changes will have on the performance of single-GPU solutions, and (2) make projections regarding the performance of multi-GPU solutions at larger scale
Towards Loosely-Coupled Programming on Petascale Systems
We have extended the Falkon lightweight task execution framework to make
loosely coupled programming on petascale systems a practical and useful
programming model. This work studies and measures the performance factors
involved in applying this approach to enable the use of petascale systems by a
broader user community, and with greater ease. Our work enables the execution
of highly parallel computations composed of loosely coupled serial jobs with no
modifications to the respective applications. This approach allows a new-and
potentially far larger-class of applications to leverage petascale systems,
such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O
performance encountered in making this model practical, and show results using
both microbenchmarks and real applications from two domains: economic energy
modeling and molecular dynamics. Our benchmarks show that we can scale up to
160K processor-cores with high efficiency, and can achieve sustained execution
rates of thousands of tasks per second.Comment: IEEE/ACM International Conference for High Performance Computing,
Networking, Storage and Analysis (SuperComputing/SC) 200
Model-driven Scheduling for Distributed Stream Processing Systems
Distributed Stream Processing frameworks are being commonly used with the
evolution of Internet of Things(IoT). These frameworks are designed to adapt to
the dynamic input message rate by scaling in/out.Apache Storm, originally
developed by Twitter is a widely used stream processing engine while others
includes Flink, Spark streaming. For running the streaming applications
successfully there is need to know the optimal resource requirement, as
over-estimation of resources adds extra cost.So we need some strategy to come
up with the optimal resource requirement for a given streaming application. In
this article, we propose a model-driven approach for scheduling streaming
applications that effectively utilizes a priori knowledge of the applications
to provide predictable scheduling behavior. Specifically, we use application
performance models to offer reliable estimates of the resource allocation
required. Further, this intuition also drives resource mapping, and helps
narrow the estimated and actual dataflow performance and resource utilization.
Together, this model-driven scheduling approach gives a predictable application
performance and resource utilization behavior for executing a given DSPS
application at a target input stream rate on distributed resources.Comment: 54 page
System-Level Modeling, Analysis and Code Generation: Object Recognition Case Study
International audienceOne of the most important challenges in complex embedded systems design is developing methods and tools for modeling and analyzing the behavior of application software running on multi-processor platforms. We propose a tool-supported flow for systematic and compositional construction of mixed software/hardware system models. These models are intended to represent, in an operational way, the set of timed executions of parallel application software statically mapped on a multi-processor platform. As such, system models will be used for performance analysis using simulation-based techniques as well as for code generation on specific platforms. The construction of the system model proceeds in two steps. In the first step, an abstract system model is obtained by composition and specific transformations of (1) the (untimed) model of the application software, (2) the model of the platform and (3) the mapping between them. In the second step, the abstract system model is refined into concrete system model, by including specific timing constraints for execution of the application software, according to chosen mapping on the platform. We illustrate the system model construction method and its use for performance analysis and code generation on an object recognition application provided by Hellenic Airspace Industry. This case study is build upon the HMAX models algorithm [RP99] and is looking at significant speedup factors. This paper reports results obtained on different system model configurations and used to determine the optimal implementation strategy in accordance to hardware resources
Automatic synthesis and optimization of chip multiprocessors
The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs
Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this complicatesthe problem of data tracking and search/invalidation; tracking the state of a line at all on-chip caches at a directory or performing full-chip broadcasts are both non-scalable and extremely expensive solutions. In this paper, we make the case for Locality-Oblivious Cache Organization (LOCO), a CMP cache organization that leverages the on-chip network to create virtual single-cycle paths between distant caches, thus redefining the notion of locality. LOCO is a clustered cache organization, supporting both homogeneous and heterogeneous cluster sizes, and provides near single-cycle accesses to data anywhere within the cluster, just like a private cache. Globally, LOCO dynamically creates a virtual mesh connecting all the clusters, and performs an efficient global data search and migration over this virtual mesh, without having to resort to full-chip broadcasts or perform expensive directory lookups. Trace-driven and full system simulations running SPLASH-2 and PARSEC benchmarks show that LOCO improves application run time by up to 44.5% over baseline private and shared cache.Semiconductor Research CorporationUnited States. Defense Advanced Research Projects Agency (Semiconductor Technology Advanced Research Network
Assessing load-sharing within optimistic simulation platforms
The advent of multi-core machines has lead to the need for revising the architecture of modern simulation platforms. One recent proposal we made attempted to explore the viability of load-sharing for optimistic simulators run on top of these types of machines. In this article, we provide an extensive experimental study for an assessment of the effects on run-time dynamics by a load-sharing architecture that has been implemented within the ROOT-Sim package, namely an open source simulation platform adhering to the optimistic synchronization paradigm. This experimental study is essentially aimed at evaluating possible sources of overheads when supporting load-sharing. It has been based on differentiated workloads allowing us to generate different execution profiles in terms of, e.g., granularity/locality of the simulation events. © 2012 IEEE
- …