Search CORE

10 research outputs found

Programming Challenges for Many-Core Computing

Author: Ghuloum Anwar
Publication venue
Publication date: 01/01/2007
Field of study

Many-core architectures face significant hurdles to successful adoption by ISVs, and ultimately, the marketplace. One of the most difficult is addressing the programmability problems associated with parallel computing. For example, it is notoriously difficult to debug a parallel application, given the potential interleavings of the various threads of control in that application. Another problem is that predicting performance, even at coarse accuracy, is extremely inaccurate. I will explain why a chip company like Intel is interested in advanced programming languages research and believes this is critical to adoption of many-core architectures. Intel's Programming Research Lab is addressing these issues for both client and server computing, in particular media and gaming workloads. We are implementing a high-level programming abstractions based on transactional memory, data parallel programming models and functional languages. In this talk, I will briefly discuss a language based on Nested Data Parallelism (NDP) called Ct. NDP models have the advantage of being deterministic, meaning that the functional behaviors of sequential and parallel executions of an NDP program are always the same for the same input. Data races are not possible in this model. Furthermore, NDP models have an easy to understand coarse performance model, which can be made more accurate for specific architectural families. This enables the programmer to comprehend the performance implications of their code well-enough to make well- informed algorithmic choices. About the speaker Anwar Ghuloum earned degrees at the University of California, Los Angeles (B.S., Computer Science and Engineering) and Carnegie Mellon University's School of Computer Science (Ph.D., Computer Science, 1996), where his thesis introduced concepts of Nested Data Parallel idioms to traditional parallelizing compilers. Anwar has been a Senior Staff Scientist with Intel's Programming Systems Lab since joining in early 2002, working on diverse topics such as optimizing memory system performance, parallel architecture evaluation, parallel language and compiler design, and multimedia applications. Before that, he co-founded and was the CTO of a fab- less semiconductor startup called Intensys that built programmable, highly parallel image and video processors for the consumer electronics market. Prior to that, Anwar developed novel predictive drug design software for early lead optimization using 3D surface pattern recognition techniques for a biotech startup called MetaXen (acquired by Exelexis Pharmaceuticals). He has also served as a post-doctoral research associate at Stanford University's Computer Science department. A recurring theme in Anwar's work has been to bridge high-level application knowledge and low-level parallel architecture constraints with careful parallel language and compiler design

CERN Document Server

Optimizing data parallel operations on many-core platforms

Author: Anwar M. Ghuloum
Byoungro So
Youfeng Wu
Publication venue
Publication date
Field of study

Data parallel operations are widely used in game, multimedia, physics and data-intensive and scientific applications. Unlike control parallelism, data parallelism comes from simultaneous operations across large sets of collection-oriented data such as vectors and matrices. A simple implementation can use OpenMP directives to execute operations on multiple data concurrently. However, this implementation introduces a lot of barriers across data parallel operations and even within a single data parallel operation to synchronize the concurrent threads. This synchronization cost may overwhelm the benefit of data parallelism. Moreover, barriers prohibit many optimization opportunities among parallel regions. In this paper, we describe an approach to optimizing data parallel operations on many-core platforms, called sub-primitive fusion, which reduces expensive barriers by merging code regions of data parallel operations based on the data flow information. It also replaces remaining barriers with light-weight synchronization mechanisms. This approach enables other optimization opportunities such as data reuse across data parallel operations, dynamic partitioning of fused data parallel operations, and semiasynchronous parallel execution among the threads. We present preliminary experimental results for the sparse matrix kernels that demonstrate the benefits of this approach. We observe speedups up to 5 on an 8-way SMP machine compared against the serial execution time. 1

CiteSeerX

Asymmetric Chip Multiprocessors: Balancing Hardware Efficiency and Programmer Efficiency

Author: Anwar Ghuloum
Anwar Rohillah
Doug Carmean
Eric Sprangle
Intel Corporation
M. Aater Suleman Yale Patt
Publication venue
Publication date
Field of study

Chip Multiprocessors are becoming common as the cost of increasing chip power begins to limit single core performance. The most power efficient CMP consists of low power in-order cores. However, performance on such a processor is low unless the workload is nearly completely parallelized, which depending on the workload can be impossible or require significant programmer effort. This paper argues that the programmer effort required to parallelize an application can be reduced if the underlying architecture promises faster execution of the serial portion of an application. In such a case, programmers can parallelize only the easier-to-parallelize portions of the application and rely on the hardware to run the serial portion faster. We make a case for an architecture which contains one high performance out-of-order processor and multiple low performance in-order processors. We call it an Asymmetric Chip Multiprocessor (ACMP). Although the out-of-order core in the ACMP makes it less power efficient, it enables the ACMP to produce higher performance gains with less programmer effort

CiteSeerX

SUIF Explorer: A Programming Assistant for Parallel Machines

Author: Anwar Ghuloum
Jr.
Monica S. Lam
Robert P. Bosch
Shih-Wei Liao
Publication venue
Publication date
Field of study

This paper presents the preliminary design of SUIF Explorer, a system that uses state-ofthe -art parallelization analysis to guide a programmer in developing and optimizing efficient parallel code. It alleviates the tedious process of performance debugging by analyzing the results from both the SUIF compiler [1] and execution analyzers and presenting high-level information to users via cutting-edge visualization tools. It focuses the attention of the user on the critical slice of an application, asking pointed questions which lead even a naive user down the same path a compiler expert might take in optimizing the program. This paper illustrates the effectiveness of SUIF Explorer with a case study based on the MDG application in the PERFECT benchmark [3]. Users only need to examine a few lines of code to parallelize a loop that spans over several procedures and over a hundred lines of code

CiteSeerX

A Flexible Parallel Programming Model for Tera-scale Architectures Table of Contents

Author: Anwar Ghuloum Eric
Jesse Fang
Key Contributors
Wu Xin Zhou
Zhao Hui Du
Zhenying Liu
Publication venue
Publication date
Field of study

1.0 New Opportunities, New Challenges..........................................................................................................................................

CiteSeerX

SUIF Explorer

Author: Amer Diwan
Anwar Ghuloum
Ernst M.
Monica S. Lam
Robert P. Bosch
Shih-Wei Liao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Parallelizing complex scans and reductions

Author: Allan L. Fisher
Anwar M. Ghuloum
Chen Shyh-Ching
Duffin R.J.
Jaja Joseph
Wolfe M.
Wyllie J. C.
Yang P.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Flattening and parallelizing irregular, recurrent loop nests

Author: Aho A.
Allan L. Fisher
Anwar M. Ghuloum
Duff S.
Kuck C.J.
Polychronopoulos C.
von Hanxleden R.
Wolfe M.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Enabling scalability and performance in a large scale CMP environment

Author: Ali-Reza Adl-Tabatabai
Anwar Ghuloum
Anwar Rohillah
Appel A. W.
Bratin Saha
Brian Murphy
Brightwell Ron
Dean J.
Doug Carmean
Eric Sprangle
Garg Sharad
Jesse Fang
Landin P. A.
Leaf Petersen
Learning-Based Computer V.
Lu H.
Mattson T. G.
Mohan Rajagopalan
Pai V. S.
Rattner J.
Rattner J.
Richard L. Hudson
So B.
Tatiana Shpeisman
Vijay Menon
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref