Search CORE

23 research outputs found

Compile -Time Performance Prediction of Scientific Programs

Author: Cascaval Gheorghe Calin
Publication venue
Publication date: 01/01/2000
Field of study

124 p.Thesis (Ph.D.)--University of Illinois at Urbana-Champaign, 2000.We use stack distances to quantify locality and we show that the average locality computed using stack distances is a very reliable metric. A new algorithm for stack processing, that is 30% faster than the best know algorithm on the suite of programs traced, is also presented.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Evaluation of OpenMP for the Cyclops multithreaded architecture

Author: Almasi George
Ayguadé Parra Eduard
Cascaval Calin
Castaños José G.
Labarta Mancho Jesús José
Martorell Bofill Xavier
Martínez Francisco
Moreira José E.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2003
Field of study

Multithreaded architectures have the potential of tolerating large memory and functional unit latencies and increase resource utilization. The Blue Gene/Cyclops architecture, being developed at the IBM T. J. Watson Research Center, is one such systems that offers massive intra-chip parallelism. Although the BG/C architecture was initially designed to execute specific applications, we believe that it can be effectively used on a broad range of parallel numerical applications. Programming such applications for this unconventional design requires a significant porting effort when using the basic built-in mechanisms for thread management and synchronization. In this paper, we describe the implementation of an OpenMP environment for parallelizing applications, currently under development at the CEPBA-IBM Research Institute, targeting BG/C. The environment is evaluated with a set of simple numerical kernels and a subset of the NAS OpenMP benchmarks. We identify issues that were not initially considered in the design of the BG/C architecture to support a programming model such as OpenMP. We also evaluate features currently offered by the BG/C architecture that should be considered in the implementation of an efficient OpenMP layer for massive intra-chip parallel architectures.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Optimizing NANOS OpenMP for the IBM Cyclops multithreaded architecture

Author: Almási George
Ayguadé Parra Eduard
Cascaval Calin
Castaños José G.
Labarta Mancho Jesús José
Martorell Bofill Xavier
Moreira Jose E.
Ródenas Picó David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

In this paper, we present two approaches to improve the execution of OpenMP applications on the IBM Cyclops multithreaded architecture. Both solutions are independent and they are focused to obtain better performance through a better management of the cache locality. The first solution is based on software modifications to the OpenMP runtime library to balance stack accesses across all data caches. The second solution is a small hardware modification to change the data cache mapping behavior, with the same goal. Both solutions help parallel applications to improve scalability and obtain better performance in this kind of architectures. In fact, they could also be applied to future multi-core processors. We have executed (using simulation) some of the NAS benchmarks to prove these proposals. They show how, with small changes in both the software and the hardware, we achieve very good scalability in parallel applications. Our results also show that standard execution environments oriented to multiprocessor architectures can be easily adapted to exploit multithreaded processors.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Estimating Cache Misses and Locality Using Stack Distances

Author: Calin Cascaval
David A. Padua
Publication venue
Publication date: 01/01/2003
Field of study

Cache behavior modeling is an important part of modern optimizing compilers. In this paper we present a method to estimate the number of cache misses, at compile time, using a machine independent model based on stack algorithms. Our algorithm computes the stack histograms symbolically, using data dependence distance vectors and is totally accurate when dependence distances are uniformly generated. The stack histogram models accurately fully associative caches with LRU replacement policy, and provides a very good approximation for set-associative caches and programs with non-constant dependence distances

CiteSeerX

Crossref

Analytical Modeling of Pipeline Parallelism

Author: Asenjo Rafael
Cascaval Calin
Navarro Angeles
Tabik Siham
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/09/2009
Field of study

Parallel programming is a requirement in the multi-core era. One of the most promising techniques to make parallel programming available for the general users is the use of parallel programming patterns. Functional pipeline parallelism is a pattern that is well suited for many emerging applications, such as streaming and "Recognition, Mining and Synthesis" (RMS) workloads. In this paper we develop an analytical model for pipeline parallelism based on queueing theory. The model is useful to both characterize the performance and efficiency of existing implementations and to guide the design of new pipeline algorithms. We demonstrate the usefulness of the model by characterizing and optimizing two of the PARSEC benchmarks, ferret and dedup. We identified two issues with these codes: load imbalance and I/O bottlenecks. We addressed load imbalance using two techniques: i) parallel pipeline stage collapsing; and ii) dynamic scheduling. We implemented these optimizations using Pthreads and the Threading Building Blocks (TBB) libraries. We compare the performance of different alternatives and we note that the TBB implementation based on work stealing outperforms all other variants

Crossref

Scipedia

Compile-time Based Performance Prediction

Author: Calin Cascaval
Daniel A. Reed
David A. Padua
Luiz DeRose
Publication venue
Publication date
Field of study

In this paper we present results we obtained using a compiler to predict performance of scientific codes. The compiler, Polaris [3], is both the primary tool for estimating the performance of a range of codes, and the beneficiary of the results obtained from predicting the program behavior at compile time. We show that a simple compile-time model, augmented with profiling data obtained using very light instrumentation, can be accurate within 20% (on average) of the measured performance for codes using both dense and sparse computational methods

CiteSeerX

Concurrency in Mobile Browser Engines

Author: Behnam Robatmili
Calin Cascaval
Dario Suarez Gracia
Pablo Montesinos Ortego
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Multidimensional Blocking in UPC

Author: Almási George
Amaral José Nelson
Barton Christopher
Cascaval Calin
Farreras Esclusa Montserrat
Garg Rahul
Publication venue
Publication date: 01/02/2008
Field of study

Abstract. Partitioned Global Address Space (PGAS) languages offer an attractive, high-productivity programming model for programming large-scale parallel machines. PGAS languages, such as Unified Parallel C (UPC), combine the simplicity of shared-memory programming with the efficiency of the messagepassing paradigm by allowing users control over the data layout. PGAS languages distinguish between private, shared-local, and shared-remote memory, with shared-remote accesses typically much more expensive than shared-local and private accesses, especially on distributed memory machines where sharedremote access implies communication over a network. In this paper we present a simple extension to the UPC language that allows the programmer to block shared arrays in multiple dimensions. We claim that this extension allows for better control of locality, and therefore performance, in the language. We describe an analysis that allows the compiler to distinguish between local shared array accesses and remote shared array accesses. Local shared array accesses are then transformed into direct memory accesses by the compiler, saving the overhead of a locality check at runtime. We present results to show that locality analysis is able to significantly reduce the number of shared accesses

CiteSeerX

UPCommons. Portal del coneixement obert de la UPC