256 research outputs found

    Hardware-Oriented Cache Management for Large-Scale Chip Multiprocessors

    Get PDF
    One of the key requirements to obtaining high performance from chip multiprocessors (CMPs) is to effectively manage the limited on-chip cache resources shared among co-scheduled threads/processes. This thesis proposes new hardware-oriented solutions for distributed CMP caches. Computer architects are faced with growing challenges when designing cache systems for CMPs. These challenges result from non-uniform access latencies, interference misses, the bandwidth wall problem, and diverse workload characteristics. Our exploration of the CMP cache management problem suggests a CMP caching framework (CC-FR) that defines three main approaches to solve the problem: (1) data placement, (2) data retention, and (3) data relocation. We effectively implement CC-FR's components by proposing and evaluating multiple cache management mechanisms.Pressure and Distance Aware Placement (PDA) decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Flexible Set Balancing (FSB), on the other hand, reduces interference misses via extending the life time of cache lines through retaining some fraction of the working set at underutilized local sets to satisfy far-flung reuses. PDA implements CC-FR's data placement and relocation components and FSB applies CC-FR's retention approach.To alleviate non-uniform access latencies and adapt to phase changes in programs, Adaptive Controlled Migration (ACM) dynamically and periodically promotes cache blocks towards L2 banks close to requesting cores. ACM lies under CC-FR's data relocation category. Dynamic Cache Clustering (DCC), on the other hand, addresses diverse workload characteristics and growing non-uniform access latencies challenges via constructing a cache cluster for each core and expands/contracts all clusters synergistically to match each core's cache demand. DCC implements CC-FR's data placement and relocation approaches. Lastly, Dynamic Pressure and Distance Aware Placement (DPDA) combines PDA and ACM to cooperatively mitigate interference misses and non-uniform access latencies. Dynamic Cache Clustering and Balancing (DCCB), on the other hand, combines DCC and FSB to employ all CC-FR's categories and achieve higher system performance. Simulation results demonstrate the effectiveness of the proposed mechanisms and show that they compare favorably with related cache designs

    Multi-disk subsystem organizations for very large databases

    Get PDF
    This thesis investigates efficient mappings of very large databases with non-uniform access to its data. to a. multi-disk subsystem. Two algorithms are developed to distribute the database across multiple disks, possibly with replication, in order to minimize latency and maximize throughput. These algorithms are compared with respect to the amount of replication overhead incurred to achieve desired throughput. A simulator is developed to simulate these two mapping algorithms and investigate the efficiency of these two mappings

    OLTP on Hardware Islands

    Get PDF
    Modern hardware is abundantly parallel and increasingly heterogeneous. The numerous processing cores have non-uniform access latencies to the main memory and to the processor caches, which causes variability in the communication costs. Unfortunately, database systems mostly assume that all processing cores are the same and that microarchitecture differences are not significant enough to appear in critical database execution paths. As we demonstrate in this paper, however, hardware heterogeneity does appear in the critical path and conventional database architectures achieve suboptimal and even worse, unpredictable performance. We perform a detailed performance analysis of OLTP deployments in servers with multiple cores per CPU (multicore) and multiple CPUs per server (multisocket). We compare different database deployment strategies where we vary the number and size of independent database instances running on a single server, from a single shared-everything instance to fine-grained shared-nothing configurations. We quantify the impact of non-uniform hardware on various deployments by (a) examining how efficiently each deployment uses the available hardware resources and (b) measuring the impact of distributed transactions and skewed requests on different workloads. Finally, we argue in favor of shared-nothing deployments that are topology- and workload-aware and take advantage of fast on-chip communication between islands of cores on the same socket.Comment: VLDB201

    Design and Simulation to Create a Uniform Concentration Distribution in Fixed Bed Catalytic Reactor Using a Static Mixer

    Get PDF
    In fixed bed reactor, catalyst particles are in different sizes and are randomly scattered in bed which led to the non-uniform flow pattern. Thus, non-uniform access of reactants to the catalytic surfaces will lead to a sharp fall in overall performance of reactor. Pressure drop and high energy consumption are among other problems that experts and artisans are faced with it in terms of a fixed bed reactor. In his study, it is noted to the simulation of concentration distribution in two porous catalytic fixed bed reactors to investigate the heterogeneous catalysis of nitrogen, one of the reactors is equipped with a static mixer. According to the results, we can see in the reactor without using of static mixer, the reaction components are not evenly distributed within the catalyst bed. This is when the static mixer makes the uniform concentration distribution of particles in the catalyst bed before entering particles to the catalyst bed. So it can be concluded. The use of a small static mixer or a few baffles in fixed bed catalytic reactor, in addition to stable and uniform concentration distribution of the catalyst bed and reduce dead space, as well as the highest purity of the material produced in the catalyst bed, the system performance will be increased as much as 20 percent

    Equilibrium and non-equilibrium Ising models by means of PCA

    Get PDF
    We propose a unified approach to reversible and irreversible PCA dynamics, and we show that in the case of 1D and 2D nearest neighbour Ising systems with periodic boundary conditions we are able to compute the stationary measure of the dynamics also when the latter is irreversible. We also show how, according to [DPSS12], the stationary measure is very close to the Gibbs for a suitable choice of the parameters of the PCA dynamics, both in the reversible and in the irreversible cases. We discuss some numerical aspects regarding this topic, including a possible parallel implementation

    Linear scaling computation of the Fock matrix. IX. Parallel computation of the Coulomb matrix

    Full text link
    We present parallelization of a quantum-chemical tree-code [J. Chem. Phys. {\bf 106}, 5526 (1997)] for linear scaling computation of the Coulomb matrix. Equal time partition [J. Chem. Phys. {\bf 118}, 9128 (2003)] is used to load balance computation of the Coulomb matrix. Equal time partition is a measurement based algorithm for domain decomposition that exploits small variation of the density between self-consistent-field cycles to achieve load balance. Efficiency of the equal time partition is illustrated by several tests involving both finite and periodic systems. It is found that equal time partition is able to deliver 91 -- 98 % efficiency with 128 processors in the most time consuming part of the Coulomb matrix calculation. The current parallel quantum chemical tree code is able to deliver 63 -- 81% overall efficiency on 128 processors with fine grained parallelism (less than two heavy atoms per processor).Comment: 7 pages, 6 figure

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
    • …
    corecore