254 research outputs found

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Self-adaptivity of applications on network on chip multiprocessors: the case of fault-tolerant Kahn process networks

    Get PDF
    Technology scaling accompanied with higher operating frequencies and the ability to integrate more functionality in the same chip has been the driving force behind delivering higher performance computing systems at lower costs. Embedded computing systems, which have been riding the same wave of success, have evolved into complex architectures encompassing a high number of cores interconnected by an on-chip network (usually identified as Multiprocessor System-on-Chip). However these trends are hindered by issues that arise as technology scaling continues towards deep submicron scales. Firstly, growing complexity of these systems and the variability introduced by process technologies make it ever harder to perform a thorough optimization of the system at design time. Secondly, designers are faced with a reliability wall that emerges as age-related degradation reduces the lifetime of transistors, and as the probability of defects escaping post-manufacturing testing is increased. In this thesis, we take on these challenges within the context of streaming applications running in network-on-chip based parallel (not necessarily homogeneous) systems-on-chip that adopt the no-remote memory access model. In particular, this thesis tackles two main problems: (1) fault-aware online task remapping, (2) application-level self-adaptation for quality management. For the former, by viewing fault tolerance as a self-adaptation aspect, we adopt a cross-layer approach that aims at graceful performance degradation by addressing permanent faults in processing elements mostly at system-level, in particular by exploiting redundancy available in multi-core platforms. We propose an optimal solution based on an integer linear programming formulation (suitable for design time adoption) as well as heuristic-based solutions to be used at run-time. We assess the impact of our approach on the lifetime reliability. We propose two recovery schemes based on a checkpoint-and-rollback and a rollforward technique. For the latter, we propose two variants of a monitor-controller- adapter loop that adapts application-level parameters to meet performance goals. We demonstrate not only that fault tolerance and self-adaptivity can be achieved in embedded platforms, but also that it can be done without incurring large overheads. In addressing these problems, we present techniques which have been realized (depending on their characteristics) in the form of a design tool, a run-time library or a hardware core to be added to the basic architecture

    Exploiting Multi-Level Parallelism in Streaming Applications for Heterogeneous Platforms with GPUs

    Get PDF
    Heterogeneous computing platforms support the traditional types of parallelism, such as e.g., instruction-level, data, task, and pipeline parallelism, and provide the opportunity to exploit a combination of different types of parallelism at different platform levels. The architectural diversity of platform components makes tapping into the platform potential a challenging programming task. This thesis makes an important step in this direction by introducing a novel methodology for automatic generation of structured, multi-level parallel programs from sequential applications. We introduce a novel hierarchical intermediate program representation (HiPRDG) that captures the notions of structure and hierarchy in the polyhedral model used for compile-time program transformation and code generation. Using the HiPRDG as the starting point, we present a novel method for generation of multi-level programs (MLPs) featuring different types of parallelism, such as task, data, and pipeline parallelism. Moreover, we introduce concepts and techniques for data parallelism identification, GPU code generation, and asynchronous data-driven execution on heterogeneous platforms with efficient overlapping of host-accelerator communication and computation. By enabling the modular, hybrid parallelization of program model components via HiPRDG, this thesis opens the door for highly efficient tailor-made parallel program generation and auto-tuning for next generations of multi-level heterogeneous platforms with diverse accelerators.Computer Systems, Imagery and Medi

    Estimation and Optimization of the Performance of Polyhedral Process Networks

    Get PDF
    A system-level design methodology such as Daedalus provides designers with a forward synthesis flow for automated design, programming, and implementation of multiprocessor systems-on-chip. Daedalus employs the polyhedral process network model of computation to represent applications. These networks are automatically derived from sequential C code. A forward synthesis flow greatly increases designer productivity. Still, the designer needs to perform a time-consuming forward synthesis step to learn if a network satisfies his performance constraints. Furthermore, it is not trivial to select a set of transformations and transformation parameters for a network such that performance requirements are met. A forward synthesis flow thus solves only part of a design problem, as it does not provide fast feedback on the transformations a designer should apply to meet his performance constraints. This dissertation intro duces different performance estimation techniques for polyhedral process networks. The most promising technique is the profiling-based cprof technique that works directly on the sequential application code. This makes cprof an easy-to-use, robust, and fast technique, without the need to derive a polyhedral process network. This dissertation then discusses four transformations and analyzes factors that affect the efficacy of each transformation.Computer Systems, Imagery and Medi

    Enabling 5G Technologies

    Get PDF
    The increasing demand for connectivity and broadband wireless access is leading to the fifth generation (5G) of cellular networks. The overall scope of 5G is greater in client width and diversity than in previous generations, requiring substantial changes to network topologies and air interfaces. This divergence from existing network designs is prompting a massive growth in research, with the U.S. government alone investing $400 million in advanced wireless technologies. 5G is projected to enable the connectivity of 20 billion devices by 2020, and dominate such areas as vehicular networking and the Internet of Things. However, many challenges exist to enable large scale deployment and general adoption of the cellular industries. In this dissertation, we propose three new additions to the literature to further the progression 5G development. These additions approach 5G from top down and bottom up perspectives considering interference modeling and physical layer prototyping. Heterogeneous deployments are considered from a purely analytical perspective, modeling co-channel interference between and among both macrocell and femtocell tiers. We further enhance these models with parameterized directional antennas and integrate them into a novel mixed point process study of the network. At the air interface, we examine Software-Defined Radio (SDR) development of physical link level simulations. First, we introduce a new algorithm acceleration framework for MATLAB, enabling real-time and concurrent applications. Extensible beyond SDR alone, this dataflow framework can provide application speedup for stream-based or data dependent processing. Furthermore, using SDRs we develop a localization testbed for dense deployments of 5G smallcells. Providing real-time tracking of targets using foundational direction of arrival estimation techniques, including a new OFDM based correlation implementation

    MULTI-SCALE SCHEDULING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS

    Get PDF
    A variety of hardware platforms for signal processing has emerged, from distributed systems such as Wireless Sensor Networks (WSNs) to parallel systems such as Multicore Programmable Digital Signal Processors (PDSPs), Multicore General Purpose Processors (GPPs), and Graphics Processing Units (GPUs) to heterogeneous combinations of parallel and distributed devices. When a signal processing application is implemented on one of those platforms, the performance critically depends on the scheduling techniques, which in general allocate computation and communication resources for competing processing tasks in the application to optimize performance metrics such as power consumption, throughput, latency, and accuracy. Signal processing systems implemented on such platforms typically involve multiple levels of processing and communication hierarchy, such as network-level, chip-level, and processor-level in a structural context, and application-level, subsystem-level, component-level, and operation- or instruction-level in a behavioral context. In this thesis, we target scheduling issues that carefully address and integrate scheduling considerations at different levels of these structural and behavioral hierarchies. The core contributions of the thesis include the following. Considering both the network-level and chip-level, we have proposed an adaptive scheduling algorithm for wireless sensor networks (WSNs) designed for event detection. Our algorithm exploits discrepancies among the detection accuracy of individual sensors, which are derived from a collaborative training process, to allow each sensor to operate in a more energy efficient manner while the network satisfies given constraints on overall detection accuracy. Considering the chip-level and processor-level, we incorporated both temperature and process variations to develop new scheduling methods for throughput maximization on multicore processors. In particular, we studied how to process a large number of threads with high speed and without violating a given maximum temperature constraint. We targeted our methods to multicore processors in which the cores may operate at different frequencies and different levels of leakage. We develop speed selection and thread assignment schedulers based on the notion of a core's steady state temperature. Considering the application-level, component-level and operation-level, we developed a new dataflow based design flow within the targeted dataflow interchange format (TDIF) design tool. Our new multiprocessor system-on-chip (MPSoC)-oriented design flow, called TDIF-PPG, is geared towards analysis and mapping of embedded DSP applications on MPSoCs. An important feature of TDIF-PPG is its capability to integrate graph level parallelism and actor level parallelism into the application mapping process. Here, graph level parallelism is exposed by the dataflow graph application representation in TDIF, and actor level parallelism is modeled by a novel model for multiprocessor dataflow graph implementation that we call the Parallel Processing Group (PPG) model. Building on the contribution above, we formulated a new type of parallel task scheduling problem called Parallel Actor Scheduling (PAS) for chip-level MPSoC mapping of DSP systems that are represented as synchronous dataflow (SDF) graphs. In contrast to traditional SDF-based scheduling techniques, which focus on exploiting graph level (inter-actor) parallelism, the PAS problem targets the integrated exploitation of both intra- and inter-actor parallelism for platforms in which individual actors can be parallelized across multiple processing units. We address a special case of the PAS problem in which all of the actors in the DSP application or subsystem being optimized can be parallelized. For this special case, we develop and experimentally evaluate a two-phase scheduling framework with three work flows --- particle swarm optimization with a mixed integer programming formulation, particle swarm optimization with a simulated annealing engine, and particle swarm optimization with a fast heuristic based on list scheduling. Then, we extend our scheduling framework to support general PAS problem which considers the actors cannot be parallelized
    • …
    corecore