80,863 research outputs found
Flexible compiler-managed L0 buffers for clustered VLIW processors
Wire delays are a major concern for current and forthcoming processors. One approach to attack this problem is to divide the processor into semi-independent units referred to as clusters. A cluster usually consists of a local register file and a subset of the functional units, while the data cache remains centralized. However, as technology evolves, the latency of such a centralized cache increase leading to an important performance impact. In this paper, we propose to include flexible low-latency buffers in each cluster in order to reduce the performance impact of higher cache latencies. The reduced number of entries in each buffer permits the design of flexible ways to map data from L1 to these buffers. The proposed L0 buffers are managed by the compiler, which is responsible to decide which memory instructions make us of them. Effective instruction scheduling techniques are proposed to generate code that exploits these buffers. Results for the Mediabench benchmark suite show that the performance of a clustered VLIW processor with a unified L1 data cache is improved by 16% when such buffers are used. In addition, the proposed architecture also shows significant advantages over both MultiVLIW processors and clustered processors with a word-interleaved cache, two state-of-the-art designs with a distributed L1 data cache.Peer ReviewedPostprint (published version
Pushing towards the Limit of Sampling Rate: Adaptive Chasing Sampling
Measurement samples are often taken in various monitoring applications. To
reduce the sensing cost, it is desirable to achieve better sensing quality
while using fewer samples. Compressive Sensing (CS) technique finds its role
when the signal to be sampled meets certain sparsity requirements. In this
paper we investigate the possibility and basic techniques that could further
reduce the number of samples involved in conventional CS theory by exploiting
learning-based non-uniform adaptive sampling.
Based on a typical signal sensing application, we illustrate and evaluate the
performance of two of our algorithms, Individual Chasing and Centroid Chasing,
for signals of different distribution features. Our proposed learning-based
adaptive sampling schemes complement existing efforts in CS fields and do not
depend on any specific signal reconstruction technique. Compared to
conventional sparse sampling methods, the simulation results demonstrate that
our algorithms allow less number of samples for accurate signal
reconstruction and achieve up to smaller signal reconstruction error
under the same noise condition.Comment: 9 pages, IEEE MASS 201
Clustering Algorithms for Scale-free Networks and Applications to Cloud Resource Management
In this paper we introduce algorithms for the construction of scale-free
networks and for clustering around the nerve centers, nodes with a high
connectivity in a scale-free networks. We argue that such overlay networks
could support self-organization in a complex system like a cloud computing
infrastructure and allow the implementation of optimal resource management
policies.Comment: 14 pages, 8 Figurs, Journa
Smoothed Efficient Algorithms and Reductions for Network Coordination Games
Worst-case hardness results for most equilibrium computation problems have
raised the need for beyond-worst-case analysis. To this end, we study the
smoothed complexity of finding pure Nash equilibria in Network Coordination
Games, a PLS-complete problem in the worst case. This is a potential game where
the sequential-better-response algorithm is known to converge to a pure NE,
albeit in exponential time. First, we prove polynomial (resp. quasi-polynomial)
smoothed complexity when the underlying game graph is a complete (resp.
arbitrary) graph, and every player has constantly many strategies. We note that
the complete graph case is reminiscent of perturbing all parameters, a common
assumption in most known smoothed analysis results.
Second, we define a notion of smoothness-preserving reduction among search
problems, and obtain reductions from -strategy network coordination games to
local-max-cut, and from -strategy games (with arbitrary ) to
local-max-cut up to two flips. The former together with the recent result of
[BCC18] gives an alternate -time smoothed algorithm for the
-strategy case. This notion of reduction allows for the extension of
smoothed efficient algorithms from one problem to another.
For the first set of results, we develop techniques to bound the probability
that an (adversarial) better-response sequence makes slow improvements on the
potential. Our approach combines and generalizes the local-max-cut approaches
of [ER14,ABPW17] to handle the multi-strategy case: it requires a careful
definition of the matrix which captures the increase in potential, a tighter
union bound on adversarial sequences, and balancing it with good enough rank
bounds. We believe that the approach and notions developed herein could be of
interest in addressing the smoothed complexity of other potential and/or
congestion games
Communication-optimal Parallel and Sequential Cholesky Decomposition
Numerical algorithms have two kinds of costs: arithmetic and communication,
by which we mean either moving data between levels of a memory hierarchy (in
the sequential case) or over a network connecting processors (in the parallel
case). Communication costs often dominate arithmetic costs, so it is of
interest to design algorithms minimizing communication. In this paper we first
extend known lower bounds on the communication cost (both for bandwidth and for
latency) of conventional (O(n^3)) matrix multiplication to Cholesky
factorization, which is used for solving dense symmetric positive definite
linear systems. Second, we compare the costs of various Cholesky decomposition
implementations to these lower bounds and identify the algorithms and data
structures that attain them. In the sequential case, we consider both the
two-level and hierarchical memory models. Combined with prior results in [13,
14, 15], this gives a set of communication-optimal algorithms for O(n^3)
implementations of the three basic factorizations of dense linear algebra: LU
with pivoting, QR and Cholesky. But it goes beyond this prior work on
sequential LU by optimizing communication for any number of levels of memory
hierarchy.Comment: 29 pages, 2 tables, 6 figure
- …