22,914 research outputs found

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Opportunistic Scheduling and Beamforming for MIMO-OFDMA Downlink Systems with Reduced Feedback

    Full text link
    Opportunistic scheduling and beamforming schemes with reduced feedback are proposed for MIMO-OFDMA downlink systems. Unlike the conventional beamforming schemes in which beamforming is implemented solely by the base station (BS) in a per-subcarrier fashion, the proposed schemes take advantages of a novel channel decomposition technique to perform beamforming jointly by the BS and the mobile terminal (MT). The resulting beamforming schemes allow the BS to employ only {\em one} beamforming matrix (BFM) to form beams for {\em all} subcarriers while each MT completes the beamforming task for each subcarrier locally. Consequently, for a MIMO-OFDMA system with QQ subcarriers, the proposed opportunistic scheduling and beamforming schemes require only one BFM index and QQ supportable throughputs to be returned from each MT to the BS, in contrast to QQ BFM indices and QQ supportable throughputs required by the conventional schemes. The advantage of the proposed schemes becomes more evident when a further feedback reduction is achieved by grouping adjacent subcarriers into exclusive clusters and returning only cluster information from each MT. Theoretical analysis and computer simulation confirm the effectiveness of the proposed reduced-feedback schemes.Comment: Proceedings of the 2008 IEEE International Conference on Communications, Beijing, May 19-23, 200

    Restricted Dynamic Programming Heuristic for Precedence Constrained Bottleneck Generalized TSP

    Full text link
    We develop a restricted dynamical programming heuristic for a complicated traveling salesman problem: a) cities are grouped into clusters, resp. Generalized TSP; b) precedence constraints are imposed on the order of visiting the clusters, resp. Precedence Constrained TSP; c) the costs of moving to the next cluster and doing the required job inside one are aggregated in a minimax manner, resp. Bottleneck TSP; d) all the costs may depend on the sequence of previously visited clusters, resp. Sequence-Dependent TSP or Time Dependent TSP. Such multiplicity of constraints complicates the use of mixed integer-linear programming, while dynamic programming (DP) benefits from them; the latter may be supplemented with a branch-and-bound strategy, which necessitates a “DP-compliant” heuristic. The proposed heuristic always yields a feasible solution, which is not always the case with heuristics, and its precision may be tuned until it becomes the exact DP

    Virtual cluster scheduling through the scheduling graph

    Get PDF
    This paper presents an instruction scheduling and cluster assignment approach for clustered processors. The proposed technique makes use of a novel representation named the scheduling graph which describes all possible schedules. A powerful deduction process is applied to this graph, reducing at each step the set of possible schedules. In contrast to traditional list scheduling techniques, the proposed scheme tries to establish relations among instructions rather than assigning each instruction to a particular cycle. The main advantage is that wrong or poor schedules can be anticipated and discarded earlier. In addition, cluster assignment of instructions is performed using another novel concept called virtual clusters, which define sets of instructions that must execute in the same cluster. These clusters are managed during the deduction process to identify incompatibilities among instructions. The mapping of virtual to physical clusters is postponed until the scheduling of the instructions has finalized. The advantages this novel approach features include: (1) accurate scheduling information when assigning, and, (2) accurate information of the cluster assignment constraints imposed by scheduling decisions. We have implemented and evaluated the proposed scheme with superblocks extracted from Speclnt95 and MediaBench. The results show that this approach produces better schedules than the previous state-of-the-art. Speed-ups are up to 15%, with average speed-ups ranging from 2.5% (2-Clusters) to 9.5% (4-Clusters).Peer ReviewedPostprint (published version

    Dependable Distributed Computing for the International Telecommunication Union Regional Radio Conference RRC06

    Full text link
    The International Telecommunication Union (ITU) Regional Radio Conference (RRC06) established in 2006 a new frequency plan for the introduction of digital broadcasting in European, African, Arab, CIS countries and Iran. The preparation of the plan involved complex calculations under short deadline and required dependable and efficient computing capability. The ITU designed and deployed in-situ a dedicated PC farm, in parallel to the European Organization for Nuclear Research (CERN) which provided and supported a system based on the EGEE Grid. The planning cycle at the RRC06 required a periodic execution in the order of 200,000 short jobs, using several hundreds of CPU hours, in a period of less than 12 hours. The nature of the problem required dynamic workload-balancing and low-latency access to the computing resources. We present the strategy and key technical choices that delivered a reliable service to the RRC06

    A batch scheduler with high level components

    Get PDF
    In this article we present the design choices and the evaluation of a batch scheduler for large clusters, named OAR. This batch scheduler is based upon an original design that emphasizes on low software complexity by using high level tools. The global architecture is built upon the scripting language Perl and the relational database engine Mysql. The goal of the project OAR is to prove that it is possible today to build a complex system for ressource management using such tools without sacrificing efficiency and scalability. Currently, our system offers most of the important features implemented by other batch schedulers such as priority scheduling (by queues), reservations, backfilling and some global computing support. Despite the use of high level tools, our experiments show that our system has performances close to other systems. Furthermore, OAR is currently exploited for the management of 700 nodes (a metropolitan GRID) and has shown good efficiency and robustness
    corecore