1,161 research outputs found

    Dynamic Scheduling, Allocation, and Compaction Scheme for Real-Time Tasks on FPGAs

    Get PDF
    Run-time reconfiguration (RTR) is a method of computing on reconfigurable logic, typically FPGAs, changing hardware configurations from phase to phase of a computation at run-time. Recent research has expanded from a focus on a single application at a time to encompass a view of the reconfigurable logic as a resource shared among multiple applications or users. In real-time system design, task deadlines play an important role. Real-time multi-tasking systems not only need to support sharing of the resources in space, but also need to guarantee execution of the tasks. At the operating system level, sharing logic gates, wires, and I/O pins among multiple tasks needs to be managed. From the high level standpoint, access to the resources needs to be scheduled according to task deadlines. This thesis describes a task allocator for scheduling, placing, and compacting tasks on a shared FPGA under real-time constraints. Our consideration of task deadlines is novel in the setting of handling multiple simultaneous tasks in RTR. Software simulations have been conducted to evaluate the performance of the proposed scheme. The results indicate significant improvement by decreasing the number of tasks rejected

    Improving performance guarantees in wormhole mesh NoC designs

    Get PDF
    Wormhole-based mesh Networks-on-Chip (wNoC) are deployed in high-performance many-core processors due to their physical scalability and low-cost. Delivering tight and time composable Worst-Case Execution Time (WCET) estimates for applications as needed in safety-critical real-time embedded systems is challenged by wNoCs due to their distributed nature. We propose a bandwidth control mechanism for wNoCs that enables the computation of tight time-composable WCET estimates with low average performance degradation and high scalability. Our evaluation with the EEMBC automotive suite and an industrial real-time parallel avionics application confirms so.The research leading to these results is funded by the European Union Seventh Framework Programme under grant agreement no. 287519 (parMERASA) and by the Ministry of Science and Technology of Spain under contract TIN2012-34557. Milos Panic is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Carles Hernández is jointly funded by the Spanish Ministry of Economy and Competitiveness and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella is partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Scalable Parallel Delaunay Image-to-Mesh Conversion for Shared and Distributed Memory Architectures

    Get PDF
    Mesh generation is an essential component for many engineering applications. The ability to generate meshes in parallel is critical for the scalability of the entire Finite Element Method (FEM) pipeline. However, parallel mesh generation applications belong to the broader class of adaptive and irregular problems, and are among the most complex, challenging, and labor intensive to develop and maintain. In this thesis, we summarize several years of the progress that we made in a novel framework for highly scalable and guaranteed quality mesh generation for finite element analysis in three dimensions. We studied and developed parallel mesh generation algorithms on both shared and distributed memory architectures. In this thesis we present a novel two-level parallel tetrahedral mesh generation framework capable of delivering and sustaining close to 6000 of concurrent work units (cores). We achieve this by leveraging concurrency at two different granularity levels by using a hybrid message passing and multi-threaded execution model which is suitable to the hierarchy of the hardware architecture of the distributed memory clusters. An end-user productivity and scalability study was performed on up to 6000 cores, and indicated very good end-user productivity with about 300 million tets per second and about 3600 weak scaling speedup. Both of these results suggest that: compared to the best previous algorithm, we have seen an improvement of more than 7000 times in performance, measured in terms of speed (elements per second) by using about 180 times more CPUs, for geometries that are by many orders of magnitude more complex

    Efficient processor allocation strategies for mesh-connected multicomputers

    Get PDF
    Abstract Efficient processor allocation and job scheduling algorithms are critical if the full computational power of large-scale multicomputers is to be harnessed effectively. Processor allocation is responsible for selecting the set of processors on which parallel jobs are executed, whereas job scheduling is responsible for determining the order in which the jobs are executed. Many processor allocation strategies have been devised for mesh-connected multicomputers and these can be divided into two main categories: contiguous and non-contiguous. In contiguous allocation, jobs are allocated distinct contiguous processor sub-meshes for the duration of their execution. Such a strategy could lead to high processor fragmentation which degrades system performance in terms of, for example, the turnaround time and system utilisation. In non-contiguous allocation, a job can execute on multiple disjoint smaller sub-meshes rather than waiting until a single sub-mesh of the requested size and shape is available. Although non-contiguous allocation increases message contention inside the network, lifting the contiguity condition can reduce processor fragmentation and increase system utilisation. Processor fragmentation can be of two types: internal and external. The former occurs when more processors are allocated to a job than it requires while the latter occurs when there are free processors enough in number to satisfy another job request, but they are not allocated to it because they are not contiguous. A lot of efforts have been devoted to reducing fragmentation, and a number of contiguous allocation strategies have been devised to recognize complete sub-meshes during allocation. Most of these strategies have been suggested for 2D mesh-connected multicomputers. However, although the 3D mesh has been the underlying network topology for a number of important multicomputers, there has been relatively little activity with regard to designing similar strategies for such a network. The very few contiguous allocation strategies suggested for the 3D mesh achieve complete sub-mesh recognition ability only at the expense of a high allocation overhead (i.e., allocation and de-allocation time). Furthermore, the allocation overhead in the existing contiguous strategies often grows with system size. The main challenge is therefore to devise an efficient contiguous allocation strategy that can exhibit good performance (e.g., a low job turnaround time and high system utilisation) with a low allocation overhead. The first part of the research presents a new contiguous allocation strategy, referred to as Turning Busy List (TBL), for 3D mesh-connected multicomputers. The TBL strategy considers only those available free sub-meshes which border from the left of those already allocated sub-meshes or which have their left boundaries aligned with that of the whole mesh network. Moreover TBL uses an efficient scheme to facilitate the detection of such available sub-meshes while maintaining a low allocation overhead. This is achieved through maintaining a list of allocated sub-meshes in order to efficiently determine the processors that can form an allocation sub-mesh for a new allocation request. The new strategy is able to identify a free sub-mesh of the requested size as long as it exists in the mesh. Results from extensive simulations under various operating loads reveal that TBL manages to deliver competitive performance (i.e., low turnaround times and high system utilisation) with a much lower allocation overhead compared to other well-known existing strategies. Most existing non-contiguous allocation strategies that have been suggested for the mesh suffer from several problems that include internal fragmentation, external fragmentation, and message contention inside the network. Furthermore, the allocation of processors to job requests is not based on free contiguous sub-meshes in these existing strategies. The second part of this research proposes a new non-contiguous allocation strategy, referred to as Greedy Available Busy List (GABL) strategy that eliminates both internal and external fragmentation and alleviates the contention in the network. GABL combines the desirable features of both contiguous and non-contiguous allocation strategies as it adopts the contiguous allocation used in our TBL strategy. Moreover, GABL is flexible enough in that it could be applied to either the 2D or 3D mesh. However, for the sake of the present study, the new non-contiguous allocation strategy is discussed for the 2D mesh and compares its performance against that of well-known non-contiguous allocation strategies suggested for this network. One of the desirable features of GABL is that it can maintain a high degree of contiguity between processors compared to the previous allocation strategies. This, in turn, decreases the number of sub-meshes allocated to a job, and thus decreases message distances, resulting in a low inter-processor communication overhead. The performance analysis here indicates that the new proposed strategy has lower turnaround time than the previous non-contiguous allocation strategies for most considered cases. Moreover, in the presence of high message contention due to heavy network traffic, GABL exhibits superior performance in terms of the turnaround time over the previous contiguous and non-contiguous allocation strategies. Furthermore, GABL exhibits a high system utilisation as it manages to eliminate both internal and external fragmentation. The performance of many allocation strategies including the ones suggested above, has been evaluated under the assumption that job execution times follow an exponential distribution. However, many measurement studies have convincingly demonstrated that the execution times of certain computational applications are best characterized by heavy-tailed job execution times; that is, many jobs have short execution times and comparatively few have very long execution times. Motivated by this observation, the final part of this thesis reviews the performance of several contiguous allocation strategies, including TBL, in the context of heavy-tailed distributions. This research is the first to analyze the performance impact of heavy-tailed job execution times on the allocation strategies suggested for mesh-connected multicomputers. The results show that the performance of the contiguous allocation strategies degrades sharply when the distribution of job execution times is heavy-tailed. Further, adopting an appropriate scheduling strategy, such as Shortest-Service-Demand (SSD) as opposed to First-Come-First-Served (FCFS), can significantly reduce the detrimental effects of heavy-tailed distributions. Finally, while the new contiguous allocation strategy (TBL) is as good as the best competitor of the previous contiguous allocation strategies in terms of job turnaround time and system utilisation, it is substantially more efficient in terms of allocation overhead

    A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation

    Full text link
    We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve radiative heat transfer in turbulent flows of non-grey participating media that can be coupled to fully resolved turbulent flows, namely to Direct Numerical Simulation (DNS). The spectrally varying absorption coefficient is treated in a narrow-band fashion with a correlated-k distribution. The implementation is verified with analytical solutions and validated with results from literature and line-by-line Monte Carlo computations. The method is implemented on GPU with a thorough attention to memory transfer and computational efficiency. The bottlenecks that dominate the computational expenses are addressed and several techniques are proposed to optimize the GPU execution. By implementing the proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude can be achieved, while maintaining the same accuracy

    pTNoC: Probabilistically time-analyzable tree-based NoC for mixed-criticality systems

    Get PDF
    The use of networks-on-chip (NoC) in real-time safety-critical multicore systems challenges deriving tight worst-case execution time (WCET) estimates. This is due to the complexities in tightly upper-bounding the contention in the access to the NoC among running tasks. Probabilistic Timing Analysis (PTA) is a powerful approach to derive WCET estimates on relatively complex processors. However, so far it has only been tested on small multicores comprising an on-chip bus as communication means, which intrinsically does not scale to high core counts. In this paper we propose pTNoC, a new tree-based NoC design compatible with PTA requirements and delivering scalability towards medium/large core counts. pTNoC provides tight WCET estimates by means of asymmetric bandwidth guarantees for mixed-criticality systems with negligible impact on average performance. Finally, our implementation results show the reduced area and power costs of the pTNoC.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] under the PROXIMA Project (www.proxima-project.eu), grant agreement no 611085. This work has also been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Mladen Slijepcevic is funded by the Obra Social Fundación la Caixa under grant Doctorado “la Caixa” - Severo Ochoa. Carles Hern´andez is jointly funded by the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds through grant TIN2014-60404-JIN. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Parallel Architectures and Parallel Algorithms for Integrated Vision Systems

    Get PDF
    Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems

    Analysis of Various Decentralized Load Balancing Techniques with Node Duplication

    Get PDF
    Experience in parallel computing is an increasingly necessary skill for today’s upcoming computer scientists as processors are hitting a serial execution performance barrier and turning to parallel execution for continued gains. The uniprocessor system has now reached its maximum speed limit and, there is very less scope to improve the speed of such type of system. To solve this problem multiprocessor system is used, which have more than one processor. Multiprocessor system improves the speed of the system but it again faces some problems like data dependency, control dependency, resource dependency and improper load balancing. So this paper presents a detailed analysis of various decentralized load balancing techniques with node duplication to reduce the proper execution time

    Design Space Exploration for MPSoC Architectures

    Get PDF
    Multiprocessor system-on-chip (MPSoC) designs utilize the available technology and communication architectures to meet the requirements of the upcoming applications. In MPSoC, the communication platform is both the key enabler, as well as the key differentiator for realizing efficient MPSoCs. It provides product differentiation to meet a diverse, multi-dimensional set of design constraints, including performance, power, energy, reconfigurability, scalability, cost, reliability and time-to-market. The communication resources of a single interconnection platform cannot be fully utilized by all kind of applications, such as the availability of higher communication bandwidth for computation but not data intensive applications is often unfeasible in the practical implementation. This thesis aims to perform the architecture-level design space exploration towards efficient and scalable resource utilization for MPSoC communication architecture. In order to meet the performance requirements within the design constraints, careful selection of MPSoC communication platform, resource aware partitioning and mapping of the application play important role. To enhance the utilization of communication resources, variety of techniques such as resource sharing, multicast to avoid re-transmission of identical data, and adaptive routing can be used. For implementation, these techniques should be customized according to the platform architecture. To address the resource utilization of MPSoC communication platforms, variety of architectures with different design parameters and performance levels, namely Segmented bus (SegBus), Network-on-Chip (NoC) and Three-Dimensional NoC (3D-NoC), are selected. Average packet latency and power consumption are the evaluation parameters for the proposed techniques. In conventional computing architectures, fault on a component makes the connected fault-free components inoperative. Resource sharing approach can utilize the fault-free components to retain the system performance by reducing the impact of faults. Design space exploration also guides to narrow down the selection of MPSoC architecture, which can meet the performance requirements with design constraints.Siirretty Doriast
    corecore