7,731 research outputs found
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
Microgrid - The microthreaded many-core architecture
Traditional processors use the von Neumann execution model, some other
processors in the past have used the dataflow execution model. A combination of
von Neuman model and dataflow model is also tried in the past and the resultant
model is referred as hybrid dataflow execution model. We describe a hybrid
dataflow model known as the microthreading. It provides constructs for
creation, synchronization and communication between threads in an intermediate
language. The microthreading model is an abstract programming and machine model
for many-core architecture. A particular instance of this model is named as the
microthreaded architecture or the Microgrid. This architecture implements all
the concurrency constructs of the microthreading model in the hardware with the
management of these constructs in the hardware.Comment: 30 pages, 16 figure
Design and Implementation of a Distributed Middleware for Parallel Execution of Legacy Enterprise Applications
A typical enterprise uses a local area network of computers to perform its
business. During the off-working hours, the computational capacities of these
networked computers are underused or unused. In order to utilize this
computational capacity an application has to be recoded to exploit concurrency
inherent in a computation which is clearly not possible for legacy applications
without any source code. This thesis presents the design an implementation of a
distributed middleware which can automatically execute a legacy application on
multiple networked computers by parallelizing it. This middleware runs multiple
copies of the binary executable code in parallel on different hosts in the
network. It wraps up the binary executable code of the legacy application in
order to capture the kernel level data access system calls and perform them
distributively over multiple computers in a safe and conflict free manner. The
middleware also incorporates a dynamic scheduling technique to execute the
target application in minimum time by scavenging the available CPU cycles of
the hosts in the network. This dynamic scheduling also supports the CPU
availability of the hosts to change over time and properly reschedule the
replicas performing the computation to minimize the execution time. A prototype
implementation of this middleware has been developed as a proof of concept of
the design. This implementation has been evaluated with a few typical case
studies and the test results confirm that the middleware works as expected
Efficient resources assignment schemes for clustered multithreaded processors
New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version
Massively Parallel Computation Using Graphics Processors with Application to Optimal Experimentation in Dynamic Control
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability has lead to its adoption in many non-graphics applications, including wide variety of scientific computing fields. At the same time, a number of important dynamic optimal policy problems in economics are athirst of computing power to help overcome dual curses of complexity and dimensionality. We investigate if computational economics may benefit from new tools on a case study of imperfect information dynamic programming problem with learning and experimentation trade-off that is, a choice between controlling the policy target and learning system parameters. Specifically, we use a model of active learning and control of linear autoregression with unknown slope that appeared in a variety of macroeconomic policy and other contexts. The endogeneity of posterior beliefs makes the problem difficult in that the value function need not be convex and policy function need not be continuous. This complication makes the problem a suitable target for massively-parallel computation using graphics processors. Our findings are cautiously optimistic in that new tools let us easily achieve a factor of 15 performance gain relative to an implementation targeting single-core processors and thus establish a better reference point on the computational speed vs. coding complexity trade-off frontier. While further gains and wider applicability may lie behind steep learning barrier, we argue that the future of many computations belong to parallel algorithms anyway.Graphics Processing Units, CUDA programming, Dynamic programming, Learning, Experimentation
- …