123 research outputs found

    L2C: Combining Lossy and Lossless Compression on Memory and I/O

    Get PDF
    In this paper we introduce L2C, a hybrid lossy/lossless compression scheme applicable both to the memory subsystem and I/O traffic of a processor chip. L2C employs general-purpose lossless compression and combines it with state of the art lossy compression to achieve compression ratios up to 16:1 and improve the utilization of chip\u27s bandwidth resources. Compressing memory traffic yields lower memory access time, improving system performance and energy efficiency. Compressing I/O traffic offers several benefits for resource-constrained systems, including more efficient storage and networking.We evaluate L2C as a memory compressor in simulation with a set of approximation-tolerant applications. L2C improves baseline execution time by an average of 50\%, and total system energy consumption by 16%. Compared to the lossy and lossless current state of the art memory compression approaches, L2C improves execution time by 9% and 26% respectively, and reduces system energy costs by 3% and 5%, respectively.I/O compression efficacy is evaluated using a set of real-life datasets. L2C achieves compression ratios of up to 10.4:1 for a single dataset and on average about 4:1, while introducing no more than 0.4% error

    Lossy and Lossless Compression Techniques to Improve the Utilization of Memory Bandwidth and Capacity

    Get PDF
    Main memory is a critical resource in modern computer systems and is in increasing demand. An increasing number of on-chip cores and specialized accelerators improves the potential processing throughput but also calls for higher data rates and greater memory capacity. In addition, new emerging data-intensive applications further increase memory traffic and footprint. On the other hand, memory bandwidth is pin limited and power constrained and is therefore more difficult to scale. Memory capacity is limited by cost and energy considerations.This thesis proposes a variety of memory compression techniques as a means to reduce the memory bottleneck. These techniques target two separate problems in the memory hierarchy: memory bandwidth and memory capacity. In order to reduce transferred data volumes, lossy compression is applied which is able to reach more aggressive compression ratios. A reduction of off-chip memory traffic leads to reduced memory latency, which in turn improves the performance and energy efficiency of the system. To improve memory capacity, a novel approach to memory compaction is presented.The first part of this thesis introduces Approximate Value Reconstruction (AVR), which combines a low-complexity downsampling compressor with an LLC design able to co-locate compressed and uncompressed data. Two separate thresholds limit the error introduced by approximation. For applications that tolerate aggressive approximation in large fractions of their data, in a system with 1GB of 1600MHz DDR4 per core and 1MB of LLC space per core, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing at most 1.2% error in the application output.The second part of this thesis proposes Memory Squeeze (MemSZ), introducing a parallelized implementation of the more advanced Squeeze (SZ) compression method. Furthermore, MemSZ improves on the error limiting capability of AVR by keeping track of life-time accumulated error. An alternate memory compression architecture is also proposed, which utilizes 3D-stacked DRAM as a last-level cache. In a system with 1GB of 800MHz DDR4 per core and 1MB of LLC space per core, MemSZ improves execution time, energy and memory traffic over AVR by up to 15%, 9%, and 64%, respectively.The third part of the thesis describes L2C, a hybrid lossy and lossless memory compression scheme. L2C applies lossy compression to approximable data, and falls back to lossless if an error threshold is exceeded. In a system with 4GB of 800MHz DDR4 per core and 1MB of LLC space per core, L2C improves on the performance of MemSZ by 9%, and energy consumption by 3%.The fourth and final contribution is FlatPack, a novel memory compaction scheme. FlatPack is able to reduce the traffic overhead compared to other memory compaction systems, thus retaining the bandwidth benefits of compression. Furthermore, FlatPack is flexible to changes in block compressibility both over time and between adjacent blocks. When available memory corresponds to 50% of the application footprint, in a system with 4GB of 800MHz DDR4 per core and 1MB of LLC space per core, FlatPack increases system performance compared to current state-of-the-art designs by 36%, while reducing system energy consumption by 12%

    From the Quantum Approximate Optimization Algorithm to a Quantum Alternating Operator Ansatz

    Full text link
    The next few years will be exciting as prototype universal quantum processors emerge, enabling implementation of a wider variety of algorithms. Of particular interest are quantum heuristics, which require experimentation on quantum hardware for their evaluation, and which have the potential to significantly expand the breadth of quantum computing applications. A leading candidate is Farhi et al.'s Quantum Approximate Optimization Algorithm, which alternates between applying a cost-function-based Hamiltonian and a mixing Hamiltonian. Here, we extend this framework to allow alternation between more general families of operators. The essence of this extension, the Quantum Alternating Operator Ansatz, is the consideration of general parametrized families of unitaries rather than only those corresponding to the time-evolution under a fixed local Hamiltonian for a time specified by the parameter. This ansatz supports the representation of a larger, and potentially more useful, set of states than the original formulation, with potential long-term impact on a broad array of application areas. For cases that call for mixing only within a desired subspace, refocusing on unitaries rather than Hamiltonians enables more efficiently implementable mixers than was possible in the original framework. Such mixers are particularly useful for optimization problems with hard constraints that must always be satisfied, defining a feasible subspace, and soft constraints whose violation we wish to minimize. More efficient implementation enables earlier experimental exploration of an alternating operator approach to a wide variety of approximate optimization, exact optimization, and sampling problems. Here, we introduce the Quantum Alternating Operator Ansatz, lay out design criteria for mixing operators, detail mappings for eight problems, and provide brief descriptions of mappings for diverse problems.Comment: 51 pages, 2 figures. Revised to match journal pape

    Well-Distributed Sequences: Number Theory, Optimal Transport, and Potential Theory

    Get PDF
    The purpose of this dissertation will be to examine various ways of measuring how uniformly distributed a sequence of points on compact manifolds and finite combinatorial graphs can be, providing bounds and novel explicit algorithms to pick extremely uniform points, as well as connecting disparate branches of mathematics such as Number Theory and Optimal Transport. Chapter 1 sets the stage by introducing some of the fundamental ideas and results that will be used consistently throughout the thesis: we develop and establish Weyl\u27s Theorem, the definition of discrepancy, LeVeque\u27s Inequality, the Erdős-Turán Inequality, Koksma-Hlawka Inequality, and Schmidt\u27s Theorem about Irregularities of Distribution. Chapter 2 introduces the Monge-Kantorovich transport problem with special emphasis on the Benamou-Brenier Formula (from 2000) and Peyre\u27s inequality (from 2018). Chapter 3 explores Peyre\u27s Inequality in further depth, considering how specific bounds on the Wasserstein distance between a point measure and the uniform measure may be obtained using it, in particular in terms of the Green\u27s function of the Laplacian on a manifold. We also show how a smoothing procedure can be applied by propagating the heat equation on probability mass in order to get stronger bounds on transport distance using well-known properties of the heat equation. In Chapter 4, we turn to the primary question of the thesis: how to select points on a space which are as uniformly distributed as possible. We consider various diverse approaches one might attempt: an ergodic approach iterating functions with good mixing properties; a dyadic approach introduced in a 1975 theorem of Kakutani on proportional splittings on intervals; and a completely novel potential theoretic approach, assigning energy to point configurations and greedily minimizing the total potential arising from pair-wise point interactions. Such energy minimization questions are certainly not new, in the static setting--physicist Thomson posed the question of how to minimize the potential of electrons on a sphere as far back as 1904. However, a greedy approach to uniform distribution via energy minimization is novel, particularly through the lens of Wasserstein, and yields provably Wasserstein-optimal point sequences using the Green\u27s function of the Laplacian as our energy function on manifolds of dimension at least 3 (with dimension 2 losing at most a square root log factor from the optimal bound). We connect this to known results from Graham, Pausinger, and Proinov regarding best possible uniform bounds on the Wasserstein 2-distance of point sequences in the unit interval. We also present many open questions and conjectures on the optimal asymptotic bounds for total energy of point configurations and the growth of the total energy function as points are added, motivated by numerical investigations that display remarkably well-behaved qualities in the dynamical system induced by greedy minimization. In Chapter 5, we consider specific point sequences and bounds on the transport distance from the point measure they generate to the uniform measure. We provide provably optimal rates for the van der Corput sequence, the Kronecker sequence, regular grids and the measures induced by quadratic residues in a field of prime order. We also prove an upper bound for higher degree monomial residues in fields of prime order, and conjecture this to be optimal. In Chapter 6, we consider numerical integration error bounds over Lipschitz functions, asking how closely we can estimate the integral of a function by averaging its values at finitely many points. This is a rather classical question that was answered completely by Bakhalov in 1959 and has since become a standard example (`the easiest case which is perfectly understood\u27). Somewhat surprisingly perhaps, we show that the result is not sharp and improve it in two ways: by refining the function space and by proving that these results can be true uniformly along a subsequence. These bounds refine existing results that were widely considered to be optimal, and we show the intimate connection between transport distance and integration error. Our results are new even for the classical discrete grid. In Chapter 7, we study the case of finite graphs--we show that the fundamental question underlying this thesis can also be meaningfully posed on finite graphs where it leads to a fascinating combinatorial problem. We show that the philosophy introduced in Chapter 4 can be meaningfully adapted and obtain a potential-theoretic algorithm that produces such a sequence on graphs. We show that, using spectral techniques, we are able to obtain empirically strong bounds on the 1-Wasserstein distance between measures on subsets of vertices and the uniform measure, which for graphs of large diameter are much stronger than the trivial diameter bound

    Algorithmic Issues in some Disjoint Clustering Problems in Combinatorial Circuits

    Get PDF
    As the modern integrated circuit continues to grow in complexity, the design of very large-scale integrated (VLSI) circuits involves massive teams employing state-of-the-art computer-aided design (CAD) tools. An old, yet significant CAD problem for VLSI circuits is physical design automation. In this problem, one needs to compute the best physical layout of millions to billions of circuit components on a tiny silicon surface. The process of mapping an electronic design to a chip involves several physical design stages, one of which is clustering. Even for combinatorial circuits, there exist several models for the clustering problem. In particular, we consider the problem of disjoint clustering in combinatorial circuits for delay minimization (CN). The problem of clustering with replication for delay minimization has been well-studied and known to be solvable in polynomial time. However, replication can become expensive when it is unbounded. Consequently, CN is a problem worth investigating. In this dissertation, we establish the computational complexities of several variants of CN. We also present approximation and exact exponential algorithms for some variants of CN. In some cases, we even obtain an approximation factor of strictly less than two. Furthermore, our exact exponential algorithms beat brute force

    Non-parametric PSF estimation from celestial transit solar images using blind deconvolution

    Get PDF
    Context: Characterization of instrumental effects in astronomical imaging is important in order to extract accurate physical information from the observations. The measured image in a real optical instrument is usually represented by the convolution of an ideal image with a Point Spread Function (PSF). Additionally, the image acquisition process is also contaminated by other sources of noise (read-out, photon-counting). The problem of estimating both the PSF and a denoised image is called blind deconvolution and is ill-posed. Aims: We propose a blind deconvolution scheme that relies on image regularization. Contrarily to most methods presented in the literature, our method does not assume a parametric model of the PSF and can thus be applied to any telescope. Methods: Our scheme uses a wavelet analysis prior model on the image and weak assumptions on the PSF. We use observations from a celestial transit, where the occulting body can be assumed to be a black disk. These constraints allow us to retain meaningful solutions for the filter and the image, eliminating trivial, translated and interchanged solutions. Under an additive Gaussian noise assumption, they also enforce noise canceling and avoid reconstruction artifacts by promoting the whiteness of the residual between the blurred observations and the cleaned data. Results: Our method is applied to synthetic and experimental data. The PSF is estimated for the SECCHI/EUVI instrument using the 2007 Lunar transit, and for SDO/AIA using the 2012 Venus transit. Results show that the proposed non-parametric blind deconvolution method is able to estimate the core of the PSF with a similar quality to parametric methods proposed in the literature. We also show that, if these parametric estimations are incorporated in the acquisition model, the resulting PSF outperforms both the parametric and non-parametric methods.Comment: 31 pages, 47 figure

    Domain Ordering and Box Cover Problems for Beyond Worst-Case Join Processing

    Get PDF
    Join queries are a fundamental computational task in relational database management systems. For decades, complex joins were most often computed by decomposing the query into a query plan made of a sequence of binary joins. However, for cyclic queries, this type of query plan is sub-optimal. The worst-case run time of any such query plan exceeds the number of output tuples for any query instance. Recent theoretical developments in join query processing have led to join algorithms which are worst-case optimal, meaning that they run in time proportional to the worst-case output size for any query with the same shape and the same number of input tuples. Building on these results are a class of algorithms providing bounds which go beyond this worst-case output size by exploiting the structure of the input instance rather than just the query shape. One such algorithm, Tetris, is worst-case optimal and also provides an upper bound on its run time which depends on the minimum size of a geometric box certificate for the input query. A box certificate is a subset of a box cover whose union covers every tuple which is not present in the query output. A box cover is a set of n-dimensional boxes which cover all of the tuples not contained in the input relations. Many query instances admit different box certificates and box covers when the values in the attributes' domains are ordered differently. If we permute the input query according to a domain ordering which admits a smaller box certificate, use the permuted query as input to Tetris, then transform the result back with the inverse domain ordering, we can compute the query faster than was possible if the domain ordering was fixed. If we can efficiently compute an optimal domain ordering for a query, then we can state a beyond worst-case bound that is stronger than what is provided by Tetris. This paper defines several optimization problems over the space of domain orderings where the objective is to minimize the size of either the minimum box certificate or the minimum box cover for the given input query. We show that most of these problems are NP-hard. We also provide approximation algorithms for several of these problems. The most general version of the box cover minimization problem we will study, BoxMinPDomF, is shown to be NP-hard, but we can compute an approximation only a poly-logarithmic factor larger than K^(a*r), where K is the minimum box cover size under any domain ordering and r is the maximum number of attributes in a relation. This result allows us to compute join queries in time N+K^(a*r*(w+1))+Z, times a poly-logarithmic factor in N, where N is the number of input tuples, w is the treewidth of the query, and Z is the number of output tuples. This is a new beyond worst-case bound. There are queries for which this bound is exponentially smaller than any bound provided by Tetris. The most general version of the box certificate minimization problem we study, CertMinPDomF, is also shown to be NP-hard. It can be computed exactly if the minimum box certificate size is at most 3, but no approximation algorithm for an arbitrary minimum size is known. Finding such an approximation algorithm is an important direction for future research
    • …
    corecore