470 research outputs found

    Comparing barrier algorithms

    Get PDF
    A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, a collection of barrier algorithms with either linear or logarithmic depth are presented. A graphical model is described that profiles the execution of the barriers and other parallel programming constructs. This model shows how the interaction between the barrier algorithms and the work that they synchronize can impact their performance. One result is that logarithmic tree structured barriers show good performance when synchronizing fixed length work, while linear self-scheduled barriers show better performance when synchronizing fixed length work with an imbedded critical section. The linear barriers are better able to exploit the process skew associated with critical sections. Timing experiments, performed on an eighteen processor Flex/32 shared memory multiprocessor, that support these conclusions are detailed

    An Efficient Online Benefit-aware Multiprocessor Scheduling Technique for Soft Real-Time Tasks Using Online Choice of Approximation Algorithms

    Get PDF
    Maximizing the benefit gained by soft real-time tasks in many applications and embedded systems is highly needed to provide an acceptable QoS (Quality of Service). Examples of such applications and embedded systems include real-time medical monitoring systems, video- streaming servers, multiplayer video games, and mobile multimedia devices. In these systems, tasks are not equally critical (or beneficial). Each task comes with its own benefit-density function which can be different from the others’. The sooner a task completes, the more benefit it gains. In this work, a novel online benefit-aware preemptive approach is presented in order to enhance scheduling of soft real-time aperiodic and periodic tasks in multiprocessor systems. The objective of this work is enhancing the QoS by increasing the total benefit, while reducing flow times and deadline misses. This method prioritizes the tasks using their benefit-density functions, which imply their importance to the system, and schedules them in a real-time basis. The first model I propose is for scheduling soft real-time aperiodic tasks. An online choice of two approximation algorithms, greedy and load-balancing, is used in order to distribute the low- priority tasks among identical processors at the time of their arrival without using any statistics. The results of theoretical analysis and simulation experiments show that this method is able to maximize the gained benefit and decrease the computational complexity (compared to existing algorithms) while minimizing makespan with fewer missed deadlines and more balanced usage of processors. I also propose two more versions of this algorithm for scheduling SRT periodic tasks, with implicit and non-implicit deadlines, in addition to another version with a modified loadbalancing factor. The extensive simulation experiments and empirical comparison of these algorithms with the state of the art, using different utilization levels and various benefit density functions show that these new techniques outperform the existing ones. A general framework for benefit-aware multiprocessor scheduling in applications with periodic, aperiodic or mixed real-time tasks is also provided in this work.Computer Science, Department o

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Energy-efficient thermal-aware multiprocessor scheduling for real-time tasks using TCPNs

    Get PDF
    We present an energy-effcient thermal-aware real-time global scheduler for a set of hard real-time (HRT) tasks running on a multiprocessor system. This global scheduler fulfills the thermal and temporal constraints by handling two independent variables, the task allocation time and the selection of clock frequency. To achieve its goal, the proposed scheduler is split into two stages. An off-line stage, based on a deadline partitioning scheme, computes the cycles that the HRT tasks must run per deadline interval at the minimum clock frequency to save energy while honoring the temporal and thermal constraints, and computes the maximum frequency at which the system can run below the maximum temperature. Then, an on-line, event-driven stage performs global task allocation applying a Fixed-Priority Zero-Laxity policy, reducing the overhead of quantum-based or interval-based global schedulers. The on-line stage embodies an adaptive scheduler that accepts or rejects soft RT aperiodic tasks throttling CPU frequency to the upper lowest available one to minimize power consumption while meeting time and thermal constraints. This approach leverages the best of two worlds: the off-line stage computes an ideal discrete HRT multiprocessor schedule, while the on-line stage manage soft real-time aperiodic tasks with minimum power consumption and maximum CPU utilization

    Communication Awareness

    Get PDF

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    A CPU-GPU Hybrid Approach for Accelerating Cross-correlation Based Strain Elastography

    Get PDF
    Elastography is a non-invasive imaging modality that uses ultrasound to estimate the elasticity of soft tissues. The resulting images are called 'elastograms'. Elastography techniques are promising as cost-effective tools in the early detection of pathological changes in soft tissues. The quality of elastographic images depends on the accuracy of the local displacement estimates. Cross-correlation based displacement estimators are precise and sensitive. However cross-correlation based techniques are computationally intense and may limit the use of elastography as a real-time diagnostic tool. This study investigates the use of parallel general purpose graphics processing unit (GPGPU) engines for speeding up generation of elastograms at real-time frame rates while preserving elastographic image quality. To achieve this goal, a cross-correlation based time-delay estimation algorithm was developed in C programming language and was profiled to locate performance blocks. The hotspots were addressed by employing software pipelining, read-ahead and eliminating redundant computations. The algorithm was then analyzed for parallelization on GPGPU and the stages that would map well to the GPGPU hardware were identified. By employing optimization principles for efficient memory access and efficient execution, a net improvement of 67x with respect to the original optimized C version of the estimator was achieved. For typical diagnostic depths of 3-4cm and elastographic processing parameters, this implementation can yield elastographic frame rates in the order of 50fps. It was also observed that all of the stages in elastography cannot be offloaded to the GPGPU for computation because some stages have sub-optimal memory access patterns. Additionally, data transfer from graphics card memory to system memory can be efficiently overlapped with concurrent CPU execution. Therefore a hybrid model of computation where computational load is optimally distributed between CPU and GPGPU was identified as an optimal approach to adequately tackle the speed-quality problem in real-time imaging. The results of this research suggest that use of GPGPU as a co-processor to CPU may allow generation of elastograms at real time frame rates without significant compromise in image quality, a scenario that could be very favorable in real-time clinical elastography
    • …
    corecore