22,914 research outputs found
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs) have recently emerged as an appealing
technology for severely energy-constrained environments, especially in mobile
appliances where heterogeneity in applications is mainstream. In addition,
given the growing interest for low-power high performance computing, this type
of architectures is also being investigated as a means to improve the
throughput-per-Watt of complex scientific applications.
In this paper, we design and embed several architecture-aware optimizations
into a multi-threaded general matrix multiplication (gemm), a key operation of
the BLAS, in order to obtain a high performance implementation for ARM
big.LITTLE AMPs. Our solution is based on the reference implementation of gemm
in the BLIS library, and integrates a cache-aware configuration as well as
asymmetric--static and dynamic scheduling strategies that carefully tune and
distribute the operation's micro-kernels among the big and LITTLE cores of the
target processor. The experimental results on a Samsung Exynos 5422, a
system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the
big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric
scheduling attain important gains in performance with respect to its
architecture-oblivious counterparts while exploiting all the resources of the
AMP to deliver considerable energy efficiency
Opportunistic Scheduling and Beamforming for MIMO-OFDMA Downlink Systems with Reduced Feedback
Opportunistic scheduling and beamforming schemes with reduced feedback are
proposed for MIMO-OFDMA downlink systems. Unlike the conventional beamforming
schemes in which beamforming is implemented solely by the base station (BS) in
a per-subcarrier fashion, the proposed schemes take advantages of a novel
channel decomposition technique to perform beamforming jointly by the BS and
the mobile terminal (MT). The resulting beamforming schemes allow the BS to
employ only {\em one} beamforming matrix (BFM) to form beams for {\em all}
subcarriers while each MT completes the beamforming task for each subcarrier
locally. Consequently, for a MIMO-OFDMA system with subcarriers, the
proposed opportunistic scheduling and beamforming schemes require only one BFM
index and supportable throughputs to be returned from each MT to the BS, in
contrast to BFM indices and supportable throughputs required by the
conventional schemes. The advantage of the proposed schemes becomes more
evident when a further feedback reduction is achieved by grouping adjacent
subcarriers into exclusive clusters and returning only cluster information from
each MT. Theoretical analysis and computer simulation confirm the effectiveness
of the proposed reduced-feedback schemes.Comment: Proceedings of the 2008 IEEE International Conference on
Communications, Beijing, May 19-23, 200
Restricted Dynamic Programming Heuristic for Precedence Constrained Bottleneck Generalized TSP
We develop a restricted dynamical programming heuristic for a complicated traveling salesman problem: a) cities are grouped into clusters, resp. Generalized TSP; b) precedence constraints are imposed on the order of visiting the clusters, resp. Precedence Constrained TSP; c) the costs of moving to the next cluster and doing the required job inside one are aggregated in a minimax manner, resp. Bottleneck TSP; d) all the costs may depend on the sequence of previously visited clusters, resp. Sequence-Dependent TSP or Time Dependent TSP. Such multiplicity of constraints complicates the use of mixed integer-linear programming, while dynamic programming (DP) benefits from them; the latter may be supplemented with a branch-and-bound strategy, which necessitates a “DP-compliant” heuristic. The proposed heuristic always yields a feasible solution, which is not always the case with heuristics, and its precision may be tuned until it becomes the exact DP
Virtual cluster scheduling through the scheduling graph
This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters).Peer ReviewedPostprint (published version
Dependable Distributed Computing for the International Telecommunication Union Regional Radio Conference RRC06
The International Telecommunication Union (ITU) Regional Radio Conference
(RRC06) established in 2006 a new frequency plan for the introduction of
digital broadcasting in European, African, Arab, CIS countries and Iran. The
preparation of the plan involved complex calculations under short deadline and
required dependable and efficient computing capability. The ITU designed and
deployed in-situ a dedicated PC farm, in parallel to the European Organization
for Nuclear Research (CERN) which provided and supported a system based on the
EGEE Grid. The planning cycle at the RRC06 required a periodic execution in the
order of 200,000 short jobs, using several hundreds of CPU hours, in a period
of less than 12 hours. The nature of the problem required dynamic
workload-balancing and low-latency access to the computing resources. We
present the strategy and key technical choices that delivered a reliable
service to the RRC06
A batch scheduler with high level components
In this article we present the design choices and the evaluation of a batch
scheduler for large clusters, named OAR. This batch scheduler is based upon an
original design that emphasizes on low software complexity by using high level
tools. The global architecture is built upon the scripting language Perl and
the relational database engine Mysql. The goal of the project OAR is to prove
that it is possible today to build a complex system for ressource management
using such tools without sacrificing efficiency and scalability. Currently, our
system offers most of the important features implemented by other batch
schedulers such as priority scheduling (by queues), reservations, backfilling
and some global computing support. Despite the use of high level tools, our
experiments show that our system has performances close to other systems.
Furthermore, OAR is currently exploited for the management of 700 nodes (a
metropolitan GRID) and has shown good efficiency and robustness
- …