12 research outputs found

    Dynamic load balancing in parallel processing on non-homogeneous clusters

    Get PDF
    This paper analyzes the dynamic and static balancing of non-homogenous cluster architectures, simultaneously analyzing the theoretical parallel Speedup as well as the Speedup experimentally obtained. Three interconnected clusters have been used in which the machines within each cluster have homogeneous processors although different among clusters. Thus, the set can be seen as a 25-processor heterogeneous cluster or as a multi-cluster scheme with subsets of homogeneous processors. A classical application (Parallel N-Queens) with a parallel solution algorithm, where processing predominates upon communication, has been chosen so as to go deep in the load balancing aspects (dynamic or static) without distortion of results caused by communication overhead. At the same time, three forms of load distribution in the processors (Direct Static, Predictive Static and Dynamic by Demand) have been studied, analyzing in each case parallel Speedup and load unbalancing regarding problem size and the processors used.Facultad de Informátic

    Operating system support for overlapping-ISA heterogeneous multi-core architectures

    Full text link
    A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a de-sign provides a cost-effective solution for processor man-ufacturers to continuously improve both single-thread per-formance and multi-thread throughput. This design, how-ever, faces significant challenges in the operating system, which traditionally assumes only homogeneous hardware. This paper presents a comprehensive study of OS support for heterogeneous architectures in which cores have asym-metric performance and overlapping, but non-identical in-struction sets. Our algorithms allow applications to trans-parently execute and fairly share different types of cores. We have implemented these algorithms in the Linux 2.6.24 kernel and evaluated them on an actual heterogeneous plat-form. Evaluation results demonstrate that our designs effi-ciently manage heterogeneous hardware and enable signifi-cant performance improvements for a range of applications.

    Dynamic load balancing in parallel processing on non-homogeneous clusters

    Get PDF
    This paper analyzes the dynamic and static balancing of non-homogenous cluster architectures, simultaneously analyzing the theoretical parallel Speedup as well as the Speedup experimentally obtained. Three interconnected clusters have been used in which the machines within each cluster have homogeneous processors although different among clusters. Thus, the set can be seen as a 25-processor heterogeneous cluster or as a multi-cluster scheme with subsets of homogeneous processors. A classical application (Parallel N-Queens) with a parallel solution algorithm, where processing predominates upon communication, has been chosen so as to go deep in the load balancing aspects (dynamic or static) without distortion of results caused by communication overhead. At the same time, three forms of load distribution in the processors (Direct Static, Predictive Static and Dynamic by Demand) have been studied, analyzing in each case parallel Speedup and load unbalancing regarding problem size and the processors used.Facultad de Informátic

    Balance dinámico de carga en procesamiento paralelo sobre clusters no-homogéneos

    Get PDF
    En este trabajo se discute el balance de carga estático y dinámico sobre arquitecturas de cluster no-homogéneo, analizando al mismo tiempo el Speedup paralelo teórico y el obtenido experimentalmente. Se ha utilizado una combinación de 3 clusters interconectados, donde las máquinas dentro de cada cluster poseen procesadores homogéneos, pero diferentes entre clusters. De este modo el conjunto puede verse como un cluster heterogéneo de 25 procesadores o como un esquema multicluster con subconjuntos de procesadores homogéneos. Se eligió una aplicación clásica (Parallel N-Queens) con un algoritmo de solución paralela en la que predomina el procesamiento sobre la comunicación, de modo de profundizar en los aspectos del balance de carga (estático o dinámico) sin una distorsión de los resultados producido por el overhead de comunicaciones. Al mismo tiempo, se analizan tres formas de distribución de la carga en los procesadores (Estática Directa, Estática Predictiva y Dinámica por Demanda), estudiando en cada caso el Speedup paralelo y el desbalance de carga en función del tamaño del problema y los procesadores utilizados.VI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Balance dinámico de carga en procesamiento paralelo sobre clusters no-homogéneos

    Get PDF
    En este trabajo se discute el balance de carga estático y dinámico sobre arquitecturas de cluster no-homogéneo, analizando al mismo tiempo el Speedup paralelo teórico y el obtenido experimentalmente. Se ha utilizado una combinación de 3 clusters interconectados, donde las máquinas dentro de cada cluster poseen procesadores homogéneos, pero diferentes entre clusters. De este modo el conjunto puede verse como un cluster heterogéneo de 25 procesadores o como un esquema multicluster con subconjuntos de procesadores homogéneos. Se eligió una aplicación clásica (Parallel N-Queens) con un algoritmo de solución paralela en la que predomina el procesamiento sobre la comunicación, de modo de profundizar en los aspectos del balance de carga (estático o dinámico) sin una distorsión de los resultados producido por el overhead de comunicaciones. Al mismo tiempo, se analizan tres formas de distribución de la carga en los procesadores (Estática Directa, Estática Predictiva y Dinámica por Demanda), estudiando en cada caso el Speedup paralelo y el desbalance de carga en función del tamaño del problema y los procesadores utilizados.VI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Best of both latency and throughput

    Get PDF
    Abstrac

    Balance dinámico de carga en procesamiento paralelo sobre clusters no-homogéneos

    Get PDF
    En este trabajo se discute el balance de carga estático y dinámico sobre arquitecturas de cluster no-homogéneo, analizando al mismo tiempo el Speedup paralelo teórico y el obtenido experimentalmente. Se ha utilizado una combinación de 3 clusters interconectados, donde las máquinas dentro de cada cluster poseen procesadores homogéneos, pero diferentes entre clusters. De este modo el conjunto puede verse como un cluster heterogéneo de 25 procesadores o como un esquema multicluster con subconjuntos de procesadores homogéneos. Se eligió una aplicación clásica (Parallel N-Queens) con un algoritmo de solución paralela en la que predomina el procesamiento sobre la comunicación, de modo de profundizar en los aspectos del balance de carga (estático o dinámico) sin una distorsión de los resultados producido por el overhead de comunicaciones. Al mismo tiempo, se analizan tres formas de distribución de la carga en los procesadores (Estática Directa, Estática Predictiva y Dinámica por Demanda), estudiando en cada caso el Speedup paralelo y el desbalance de carga en función del tamaño del problema y los procesadores utilizados.VI Workshop de Procesamiento Distribuido y Paralelo (WPDP)Red de Universidades con Carreras en Informática (RedUNCI

    Dynamic task distribution in a heterogeneous loosely-coupled distributed computer system

    Full text link
    This thesis studies the problem of dynamic distribution of tasks between hosts in a heterogeneous, loosely-coupled, distributed computing system. The goals of the study are to (a) demonstrate reduced execution time in a computer program making subroutine calls to be executed on a computer (or computers) which will yield better performance than the one on which the program was initiated, (b) demonstrate the feasibility of dynamic task-to-host binding, (c) demonstrate the feasibility of a programmer-transparent methodology of distributed computing using a library approach. These goals are partially realized using the Remote Procedure Call protocol in a programmer-transparent framework of library calls. Examples of a distributed library, libHCS, and an associated daemon, HCSdaemon, implemented in support of these goals, are analyzed for their feasibility and effectiveness in solving this problem. Although results of the study fail to demonstrate reduced execution time, dynamic task-to-host binding and programmer transparency were achieved. Further study is indicated

    Energy-Efficient Neural Network Architectures

    Full text link
    Emerging systems for artificial intelligence (AI) are expected to rely on deep neural networks (DNNs) to achieve high accuracy for a broad variety of applications, including computer vision, robotics, and speech recognition. Due to the rapid growth of network size and depth, however, DNNs typically result in high computational costs and introduce considerable power and performance overheads. Dedicated chip architectures that implement DNNs with high energy efficiency are essential for adding intelligence to interactive edge devices, enabling them to complete increasingly sophisticated tasks by extending battery lie. They are also vital for improving performance in cloud servers that support demanding AI computations. This dissertation focuses on architectures and circuit technologies for designing energy-efficient neural network accelerators. First, a deep-learning processor is presented for achieving ultra-low power operation. Using a heterogeneous architecture that includes a low-power always-on front-end and a selectively-enabled high-performance back-end, the processor dynamically adjusts computational resources at runtime to support conditional execution in neural networks and meet performance targets with increased energy efficiency. Featuring a reconfigurable datapath and a memory architecture optimized for energy efficiency, the processor supports multilevel dynamic activation of neural network segments, performing object detection tasks with 5.3x lower energy consumption in comparison with a static execution baseline. Fabricated in 40nm CMOS, the processor test-chip dissipates 0.23mW at 5.3 fps. It demonstrates energy scalability up to 28.6 TOPS/W and can be configured to run a variety of workloads, including severely power-constrained ones such as always-on monitoring in mobile applications. To further improve the energy efficiency of the proposed heterogeneous architecture, a new charge-recovery logic family, called zero-short-circuit current (ZSCC) logic, is proposed to decrease the power consumption of the always-on front-end. By relying on dedicated circuit topologies and a four-phase clocking scheme, ZSCC operates with significantly reduced short-circuit currents, realizing order-of-magnitude power savings at relatively low clock frequencies (in the order of a few MHz). The efficiency and applicability of ZSCC is demonstrated through an ANSI S1.11 1/3 octave filter bank chip for binaural hearing aids with two microphones per ear. Fabricated in a 65nm CMOS process, this charge-recovery chip consumes 13.8µW with a 1.75MHz clock frequency, achieving 9.7x power reduction per input in comparison with a 40nm monophonic single-input chip that represents the published state of the art. The ability of ZSCC to further increase the energy efficiency of the heterogeneous neural network architecture is demonstrated through the design and evaluation of a ZSCC-based front-end. Simulation results show 17x power reduction compared with a conventional static CMOS implementation of the same architecture.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147614/1/hsiwu_1.pd

    Parallel For Loops on Heterogeneous Resources

    Get PDF
    In recent years, Graphics Processing Units (GPUs) have piqued the interest of researchers in scientific computing. Their immense floating point throughput and massive parallelism make them ideal for not just graphical applications, but many general algorithms as well. Load balancing applications and taking advantage of all computational resources in a machine is a difficult challenge, especially when the resources are heterogeneous. This dissertation presents the clUtil library, which vastly simplifies developing OpenCL applications for heterogeneous systems. The core focus of this dissertation lies in clUtil\u27s ParallelFor construct and our novel PINA scheduler which can efficiently load balance work onto multiple GPUs and CPUs simultaneously
    corecore