80,863 research outputs found

    Flexible compiler-managed L0 buffers for clustered VLIW processors

    Get PDF
    Wire delays are a major concern for current and forthcoming processors. One approach to attack this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the data cache remains centralized. However, as technology evolves, the latency of such a centralized cache increase leading to an important performance impact. In this paper, we propose to include flexible low-latency buffers in each cluster in order to reduce the performance impact of higher cache latencies. The reduced number of entries in each buffer permits the design of flexible ways to map data from L1 to these buffers. The proposed L0 buffers are managed by the compiler, which is responsible to decide which memory instructions make us of them. Effective instruction scheduling techniques are proposed to generate code that exploits these buffers. Results for the Mediabench benchmark suite show that the performance of a clustered VLIW processor with a unified L1 data cache is improved by 16% when such buffers are used. In addition, the proposed architecture also shows significant advantages over both MultiVLIW processors and clustered processors with a word-interleaved cache, two state-of-the-art designs with a distributed L1 data cache.Peer ReviewedPostprint (published version

    Pushing towards the Limit of Sampling Rate: Adaptive Chasing Sampling

    Full text link
    Measurement samples are often taken in various monitoring applications. To reduce the sensing cost, it is desirable to achieve better sensing quality while using fewer samples. Compressive Sensing (CS) technique finds its role when the signal to be sampled meets certain sparsity requirements. In this paper we investigate the possibility and basic techniques that could further reduce the number of samples involved in conventional CS theory by exploiting learning-based non-uniform adaptive sampling. Based on a typical signal sensing application, we illustrate and evaluate the performance of two of our algorithms, Individual Chasing and Centroid Chasing, for signals of different distribution features. Our proposed learning-based adaptive sampling schemes complement existing efforts in CS fields and do not depend on any specific signal reconstruction technique. Compared to conventional sparse sampling methods, the simulation results demonstrate that our algorithms allow 46%46\% less number of samples for accurate signal reconstruction and achieve up to 57%57\% smaller signal reconstruction error under the same noise condition.Comment: 9 pages, IEEE MASS 201

    Clustering Algorithms for Scale-free Networks and Applications to Cloud Resource Management

    Full text link
    In this paper we introduce algorithms for the construction of scale-free networks and for clustering around the nerve centers, nodes with a high connectivity in a scale-free networks. We argue that such overlay networks could support self-organization in a complex system like a cloud computing infrastructure and allow the implementation of optimal resource management policies.Comment: 14 pages, 8 Figurs, Journa

    Smoothed Efficient Algorithms and Reductions for Network Coordination Games

    Get PDF
    Worst-case hardness results for most equilibrium computation problems have raised the need for beyond-worst-case analysis. To this end, we study the smoothed complexity of finding pure Nash equilibria in Network Coordination Games, a PLS-complete problem in the worst case. This is a potential game where the sequential-better-response algorithm is known to converge to a pure NE, albeit in exponential time. First, we prove polynomial (resp. quasi-polynomial) smoothed complexity when the underlying game graph is a complete (resp. arbitrary) graph, and every player has constantly many strategies. We note that the complete graph case is reminiscent of perturbing all parameters, a common assumption in most known smoothed analysis results. Second, we define a notion of smoothness-preserving reduction among search problems, and obtain reductions from 22-strategy network coordination games to local-max-cut, and from kk-strategy games (with arbitrary kk) to local-max-cut up to two flips. The former together with the recent result of [BCC18] gives an alternate O(n8)O(n^8)-time smoothed algorithm for the 22-strategy case. This notion of reduction allows for the extension of smoothed efficient algorithms from one problem to another. For the first set of results, we develop techniques to bound the probability that an (adversarial) better-response sequence makes slow improvements on the potential. Our approach combines and generalizes the local-max-cut approaches of [ER14,ABPW17] to handle the multi-strategy case: it requires a careful definition of the matrix which captures the increase in potential, a tighter union bound on adversarial sequences, and balancing it with good enough rank bounds. We believe that the approach and notions developed herein could be of interest in addressing the smoothed complexity of other potential and/or congestion games

    Communication-optimal Parallel and Sequential Cholesky Decomposition

    Full text link
    Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.Comment: 29 pages, 2 tables, 6 figure
    • …
    corecore