10 research outputs found
Programming Challenges for Many-Core Computing
Many-core architectures face significant hurdles to successful
adoption by ISVs, and ultimately, the marketplace. One of the most
difficult is addressing the programmability problems associated
with parallel computing. For example, it is notoriously difficult
to debug a parallel application, given the potential interleavings
of the various threads of control in that application. Another
problem is that predicting performance, even at coarse accuracy,
is extremely inaccurate. I will explain why a chip company like
Intel is interested in advanced programming languages research and
believes this is critical to adoption of many-core architectures.
Intel's Programming Research Lab is addressing these issues for
both client and server computing, in particular media and gaming
workloads. We are implementing a high-level programming
abstractions based on transactional memory, data parallel
programming models and functional languages. In this talk, I will
briefly discuss a language based on Nested Data Parallelism (NDP)
called Ct. NDP models have the advantage of being deterministic,
meaning that the functional behaviors of sequential and parallel
executions of an NDP program are always the same for the same
input. Data races are not possible in this model. Furthermore,
NDP models have an easy to understand coarse performance model,
which can be made more accurate for specific architectural
families. This enables the programmer to comprehend the
performance implications of their code well-enough to make well-
informed algorithmic choices.
About the speaker
Anwar Ghuloum earned degrees at the University of California, Los
Angeles (B.S., Computer Science and Engineering) and Carnegie
Mellon University's School of Computer Science (Ph.D., Computer
Science, 1996), where his thesis introduced concepts of Nested
Data Parallel idioms to traditional parallelizing compilers. Anwar
has been a Senior Staff Scientist with Intel's Programming Systems
Lab since joining in early 2002, working on diverse topics such as
optimizing memory system performance, parallel architecture
evaluation, parallel language and compiler design, and multimedia
applications.
Before that, he co-founded and was the CTO of a fab-
less semiconductor startup called Intensys that built
programmable, highly parallel image and video processors for the
consumer electronics market. Prior to that, Anwar developed novel
predictive drug design software for early lead optimization using
3D surface pattern recognition techniques for a biotech startup
called MetaXen (acquired by Exelexis Pharmaceuticals). He has
also served as a post-doctoral research associate at Stanford
University's Computer Science department. A recurring theme in
Anwar's work has been to bridge high-level application knowledge
and low-level parallel architecture constraints with careful
parallel language and compiler design
Optimizing data parallel operations on many-core platforms
Data parallel operations are widely used in game, multimedia, physics and data-intensive and scientific applications. Unlike control parallelism, data parallelism comes from simultaneous operations across large sets of collection-oriented data such as vectors and matrices. A simple implementation can use OpenMP directives to execute operations on multiple data concurrently. However, this implementation introduces a lot of barriers across data parallel operations and even within a single data parallel operation to synchronize the concurrent threads. This synchronization cost may overwhelm the benefit of data parallelism. Moreover, barriers prohibit many optimization opportunities among parallel regions. In this paper, we describe an approach to optimizing data parallel operations on many-core platforms, called sub-primitive fusion, which reduces expensive barriers by merging code regions of data parallel operations based on the data flow information. It also replaces remaining barriers with light-weight synchronization mechanisms. This approach enables other optimization opportunities such as data reuse across data parallel operations, dynamic partitioning of fused data parallel operations, and semiasynchronous parallel execution among the threads. We present preliminary experimental results for the sparse matrix kernels that demonstrate the benefits of this approach. We observe speedups up to 5 on an 8-way SMP machine compared against the serial execution time. 1
Asymmetric Chip Multiprocessors: Balancing Hardware Efficiency and Programmer Efficiency
Chip Multiprocessors are becoming common as the cost of increasing chip power begins to limit single core performance. The most power efficient CMP consists of low power in-order cores. However, performance on such a processor is low unless the workload is nearly completely parallelized, which depending on the workload can be impossible or require significant programmer effort. This paper argues that the programmer effort required to parallelize an application can be reduced if the underlying architecture promises faster execution of the serial portion of an application. In such a case, programmers can parallelize only the easier-to-parallelize portions of the application and rely on the hardware to run the serial portion faster. We make a case for an architecture which contains one high performance out-of-order processor and multiple low performance in-order processors. We call it an Asymmetric Chip Multiprocessor (ACMP). Although the out-of-order core in the ACMP makes it less power efficient, it enables the ACMP to produce higher performance gains with less programmer effort
SUIF Explorer: A Programming Assistant for Parallel Machines
This paper presents the preliminary design of SUIF Explorer, a system that uses state-ofthe -art parallelization analysis to guide a programmer in developing and optimizing efficient parallel code. It alleviates the tedious process of performance debugging by analyzing the results from both the SUIF compiler [1] and execution analyzers and presenting high-level information to users via cutting-edge visualization tools. It focuses the attention of the user on the critical slice of an application, asking pointed questions which lead even a naive user down the same path a compiler expert might take in optimizing the program. This paper illustrates the effectiveness of SUIF Explorer with a case study based on the MDG application in the PERFECT benchmark [3]. Users only need to examine a few lines of code to parallelize a loop that spans over several procedures and over a hundred lines of code
A Flexible Parallel Programming Model for Tera-scale Architectures Table of Contents
1.0 New Opportunities, New Challenges..........................................................................................................................................