170,844 research outputs found

    Using Reduced Graphs for Efficient HLS Scheduling

    Get PDF
    High-Level Synthesis (HLS) is the process of inferring a digital circuit from a high-level algorithmic description provided as a software implementation, usually in C/C++. HLS tools will parse the input code and then perform three main steps: allocation, scheduling, and binding. This results in a hardware architecture which can then be represented as a Register-Transfer Level (RTL) model using a Hardware Description Language (HDL), such as VHDL or Verilog. Allocation determines the amount of resources needed, scheduling finds the order in which operations should occur, and binding maps operations onto the allocated hardware resources. Two main challenges of scheduling are in its computational complexity and memory requirements. Finding an optimal schedule is an NP-hard problem, so many tools use elaborate heuristics to find a solution which satisfies prescribed implementation constraints. These heuristics require the Control/Data Flow Graph (CDFG), a representation of all operations and their dependencies, which must be stored in its entirety and therefore use large amounts of memory. This thesis presents a new scheduling approach for use in the HLS tool chain. The new technique schedules operations using an algorithm which operates on a reduced representation of the graph, which does not need to retain individual dependency information in order to generate a schedule. By using the simplified graph, the complexity of scheduling is significantly reduced, resulting in improved memory usage and lower computational effort. This new scheduler is implemented and compared to the existing scheduler in the open source version of the LegUp HLS tool. The results demonstrate that an average of 16 times speedup on the time required to determine the schedule can be achieved, with just a fraction of the memory usage (1/5 on average). All of this is achieved with 0 to 6% of added cost on the final hardware execution time

    Solving the Uncapacitated Single Allocation p-Hub Median Problem on GPU

    Full text link
    A parallel genetic algorithm (GA) implemented on GPU clusters is proposed to solve the Uncapacitated Single Allocation p-Hub Median problem. The GA uses binary and integer encoding and genetic operators adapted to this problem. Our GA is improved by generated initial solution with hubs located at middle nodes. The obtained experimental results are compared with the best known solutions on all benchmarks on instances up to 1000 nodes. Furthermore, we solve our own randomly generated instances up to 6000 nodes. Our approach outperforms most well-known heuristics in terms of solution quality and time execution and it allows hitherto unsolved problems to be solved

    The continuous p-centre problem: An investigation into variable neighbourhood search with memory

    Get PDF
    A VNS-based heuristic using both a facility as well as a customer type neighbourhood structure is proposed to solve the p-centre problem in the continuous space. Simple but effective enhancements to the original Elzinga-Hearn algorithm as well as a powerful ‘locate-allocate’ local search used within VNS are proposed. In addition, efficient implementations in both neighbourhood structures are presented. A learning scheme is also embedded into the search to produce a new variant of VNS that uses memory. The effect of incorporating strong intensification within the local search via a VND type structure is also explored with interesting results. Empirical results, based on several existing data set (TSP-Lib) with various values of p, show that the proposed VNS implementations outperform both a multi-start heuristic and the discrete-based optimal approach that use the same local search

    Analysis of Parallel Montgomery Multiplication in CUDA

    Get PDF
    For a given level of security, elliptic curve cryptography (ECC) offers improved efficiency over classic public key implementations. Point multiplication is the most common operation in ECC and, consequently, any significant improvement in perfor- mance will likely require accelerating point multiplication. In ECC, the Montgomery algorithm is widely used for point multiplication. The primary purpose of this project is to implement and analyze a parallel implementation of the Montgomery algorithm as it is used in ECC. Specifically, the performance of CPU-based Montgomery multiplication and a GPU-based implementation in CUDA are compared

    A unified modulo scheduling and register allocation technique for clustered processors

    Get PDF
    This work presents a modulo scheduling framework for clustered ILP processors that integrates the cluster assignment, instruction scheduling and register allocation steps in a single phase. This unified approach is more effective than traditional approaches based on sequentially performing some (or all) of the three steps, since it allows optimizing the global code generation problem instead of searching for optimal solutions to each individual step. Besides, it avoids the iterative nature of traditional approaches, which require repeated applications of the three steps until a valid solution is found. The proposed framework includes a mechanism to insert spill code on-the-fly and heuristics to evaluate the quality of partial schedules considering simultaneously inter-cluster communications, memory pressure and register pressure. Transformations that allow trading pressure on a type of resource for another resource are also included. We show that the proposed technique outperforms previously proposed techniques. For instance, the average speed-up for the SPECfp95 is 36% for a 4-cluster configuration.Peer ReviewedPostprint (published version
    • …
    corecore