1,751 research outputs found

    On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

    Get PDF
    Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution

    Query Rewriting and Optimization for Ontological Databases

    Full text link
    Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints which derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this paper, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation of an ontological query into an equivalent first-order query against the underlying extensional database. We present a novel query rewriting algorithm for rather general types of ontological constraints which is well-suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog+/- family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process so to produce possibly small and cost-effective UCQ rewritings for an input query.Comment: arXiv admin note: text overlap with arXiv:1312.5914 by other author

    Warping Cache Simulation of Polyhedral Programs

    Get PDF
    Techniques to evaluate a program’s cache performance fall into two camps: 1. Traditional trace-based cache simulators precisely account for sophisticated real-world cache models and support arbitrary workloads, but their runtime is proportional to the number of memory accesses performed by the program under analysis. 2. Relying on implicit workload characterizations such as the polyhedral model, analytical approaches often achieve problem-size-independent runtimes, but so far have been limited to idealized cache models. We introduce a hybrid approach, warping cache simulation, that aims to achieve applicability to real-world cache models and problem-size-independent runtimes. As prior analytical approaches, we focus on programs in the polyhedral model, which allows to reason about the sequence of memory accesses analytically. Combining this analytical reasoning with information about the cache behavior obtained from explicit cache simulation allows us to soundly fast-forward the simulation. By this process of warping, we accelerate the simulation so that its cost is often independent of the number of memory accesses

    Near-optimal replacement policies for shared caches in multicore processors

    Get PDF
    An optimal replacement policy that minimizes the miss rate in a private cache was proposed several decades ago. It requires knowing the future access sequence the cache will receive. There is no equivalent for shared caches because replacement decisions alter this future sequence. We present a novel near-optimal policy for minimizing the miss rate in a shared cache that approaches the optimal execution iteratively. During each iteration, the future access sequence is reconstructed on every miss interleaving the future per-core sequences, taken from the previous iteration. This single sequence feeds a classical private-cache optimum replacement policy. Our evaluation on a shared last-level cache shows that our proposal iteratively converges to a near-optimal miss rate that is independent of the initial conditions, within a margin of 0.1%. The best state-of-the-art online policies achieve around 65% of the miss rate reduction obtained by our near-optimal proposal. In a shared cache, miss rate optimization does not imply the optimization of other metrics. Therefore, we also propose a new near-optimal policy to maximize fairness between cores. The best state-of-the-art online policy achieves 60% of the improvement in fairness seen with our near-optimal policy. Our proposals are useful both for setting upper performance bounds and inspiring implementable mechanisms for shared caches.The authors acknowledge support from grants (1) PID2019-105660RB-C21 and PID2019-107255GB-C22 from Agencia Estatal de Investigación (AEI) from Spain and European Regional Development Fund (ERDF); (2) gaZ: T58_20R research group from Aragón Government and European Social Fund (ESF); and (3) 2014-2020 "Construyendo Europa desde Aragón" from European Regional Development Fund (ERDF).Peer ReviewedPostprint (author's final draft

    Dispersal, facilitation, and burrow architecture in banner-tailed kangaroo rats

    Get PDF
    The largest and most dominant kangaroo rat species in the Chihuahuan Desert is the banner-tailed kangaroo rat (Dipodomys spectabilis). This keystone species constructs mounds containing a complex burrow system around which their ecosystem engineering activities are centered. I studied a population of banner-tailed kangaroo rats at the Sevilleta National Wildlife Refuge, New Mexico from 2005-2009. Specifically, I examined how banner-tailed kangaroo rats: 1) modify their mounds in response to seasonal conditions; 2) spatially affect harvester ants (Pogonomyrmex rugosus) through ecosystem engineering activities; and 3) differ in timing of natal dispersal between sexes. I used mark-recapture, genetic, experimental, and spatially-explicit methods to address these areas of interest. I observed that kangaroo rats remodeled their mounds seasonally in relation to changes in predation risk, seed spoilage risk, and metabolic costs. My results documented an additional keystone effect of banner-tailed kangaroo rats in the Chihuahuan Desert, a facilitatory impact on the spatial structure and dynamics of harvester ant colonies. I also experimentally determined that physiological cues influence timing of natal dispersal in males and females differently

    Reducing Waste in Memory Hierarchies

    Get PDF
    Memory hierarchies play an important role in microarchitectural design to bridge the performance gap between modern microprocessors and main memory. However, memory hierarchies are inefficient due to storing waste. This dissertation quantifies two types of waste, dead blocks and data redundancy. This dissertation studies waste in diverse memory hierarchies and proposes techniques to reduce waste to improve performance with limited overhead. This dissertation observes that waste of dead blocks in an inclusive last level cache consists of two kinds of blocks: blocks that are highly accessed in core caches and blocks that have low temporal locality in both core caches and the last-level cache. Blindly replacing all dead blocks in an inclusive last level cache may degrade performance. This dissertation proposes temporal-based multilevel correlating cache replacement to improve performance of inclusive cache hierarchies. This dissertation observes that waste exists in private caches of graphics processing units (GPUs) as zero-reuse blocks. This dissertation defines zero-reuse blocks as blocks that are dead after being inserted into caches. This dissertation proposes adaptive GPU cache bypassing technique to improve performance as well as reducing power consumption by dynamically bypassing zero-reuse blocks. This dissertation exploits waste of data redundancy at the block-level granularity and finds that conventional cache design wastes capacity because it stores duplicate data. This dissertation quantifies the percentage of data duplication and analyze causes. This dissertation proposes a practical cache deduplication technique to increase the effectiveness of the cache with limited area and power consumption

    Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors

    Full text link
    Most of the data referenced by sequential and parallel applications running in current chip multiprocessors are referenced by a single thread, i.e., private. Recent proposals leverage this observation to improve many aspects of chip multiprocessors, such as reducing coherence overhead or the access latency to distributed caches. The effectiveness of those proposals depends to a large extent on the amount of detected private data. However, the mechanisms proposed so far either do not consider either thread migration or the private use of data within different application phases, or do entail high overhead. As a result, a considerable amount of private data is not detected. In order to increase the detection of private data, this thesis proposes a TLB-based mechanism that is able to account for both thread migration and private application phases with low overhead. Classification status in the proposed TLB-based classification mechanisms is determined by the presence of the page translation stored in other core's TLBs. The classification schemes are analyzed in multilevel TLB hierarchies, for systems with both private and distributed shared last-level TLBs. This thesis introduces a page classification approach based on inspecting other core's TLBs upon every TLB miss. In particular, the proposed classification approach is based on exchange and count of tokens. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two ormore TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. However, TLB-based ability to classify private pages is strongly dependent on TLB size, as it relies on the presence of a page translation in the system TLBs. To overcome that, different TLB usage predictors (UP) have been proposed, which allow a page classification unaffected by TLB size. Specifically, this thesis introduces a predictor that obtains system-wide page usage information by either employing a shared last-level TLB structure (SUP) or cooperative TLBs working together (CUP).La mayor parte de los datos referenciados por aplicaciones paralelas y secuenciales que se ejecutan enCMPs actuales son referenciadas por un único hilo, es decir, son privados. Recientemente, algunas propuestas aprovechan esta observación para mejorar muchos aspectos de los CMPs, como por ejemplo reducir el sobrecoste de la coherencia o la latencia de los accesos a cachés distribuidas. La efectividad de estas propuestas depende en gran medida de la cantidad de datos que son considerados privados. Sin embargo, los mecanismos propuestos hasta la fecha no consideran la migración de hilos de ejecución ni las fases de una aplicación. Por tanto, una cantidad considerable de datos privados no se detecta apropiadamente. Con el fin de aumentar la detección de datos privados, proponemos un mecanismo basado en las TLBs, capaz de reclasificar los datos a privado, y que detecta la migración de los hilos de ejecución sin añadir complejidad al sistema. Los mecanismos de clasificación en las TLBs se han analizado en estructuras de varios niveles, incluyendo TLBs privadas y con un último nivel de TLB compartido y distribuido. Esta tesis también presenta un mecanismo de clasificación de páginas basado en la inspección de las TLBs de otros núcleos tras cada fallo de TLB. De forma particular, el mecanismo propuesto se basa en el intercambio y el cuenteo de tokens (testigos). Contar tokens en las TLBs supone una forma natural y eficiente para la clasificación de páginas de memoria. Además, evita el uso de solicitudes persistentes o arbitraje alguno, ya que si dos o más TLBs compiten para acceder a una página, los tokens se distribuyen apropiadamente y la clasifican como compartida. Sin embargo, la habilidad de los mecanismos basados en TLB para clasificar páginas privadas depende del tamaño de las TLBs. La clasificación basada en las TLBs se basa en la presencia de una traducción en las TLBs del sistema. Para evitarlo, se han propuesto diversos predictores de uso en las TLBs (UP), los cuales permiten una clasificación independiente del tamaño de las TLBs. En concreto, esta tesis presenta un sistema mediante el que se obtiene información de uso de página a nivel de sistema con la ayuda de un nivel de TLB compartida (SUP) o mediante TLBs cooperando juntas (CUP).La major part de les dades referenciades per aplicacions paral·leles i seqüencials que s'executen en CMPs actuals són referenciades per un sol fil, és a dir, són privades. Recentment, algunes propostes aprofiten aquesta observació per a millorar molts aspectes dels CMPs, com és reduir el sobrecost de la coherència o la latència d'accés a memòries cau distribuïdes. L'efectivitat d'aquestes propostes depen en gran mesura de la quantitat de dades detectades com a privades. No obstant això, els mecanismes proposats fins a la data no consideren la migració de fils d'execució ni les fases d'una aplicació. Per tant, una quantitat considerable de dades privades no es detecta apropiadament. A fi d'augmentar la detecció de dades privades, aquesta tesi proposa un mecanisme basat en les TLBs, capaç de reclassificar les dades com a privades, i que detecta la migració dels fils d'execució sense afegir complexitat al sistema. Els mecanismes de classificació en les TLBs s'han analitzat en estructures de diversos nivells, incloent-hi sistemes amb TLBs d'últimnivell compartides i distribuïdes. Aquesta tesi presenta un mecanisme de classificació de pàgines basat en inspeccionar les TLBs d'altres nuclis després de cada fallada de TLB. Concretament, el mecanisme proposat es basa en l'intercanvi i el compte de tokens. Comptar tokens en les TLBs suposa una forma natural i eficient per a la classificació de pàgines de memòria. A més, evita l'ús de sol·licituds persistents o arbitratge, ja que si dues o més TLBs competeixen per a accedir a una pàgina, els tokens es distribueixen apropiadament i la classifiquen com a compartida. No obstant això, l'habilitat dels mecanismes basats en TLB per a classificar pàgines privades depenen de la grandària de les TLBs. La classificació basada en les TLBs resta en la presència d'una traducció en les TLBs del sistema. Per a evitar-ho, s'han proposat diversos predictors d'ús en les TLBs (UP), els quals permeten una classificació independent de la grandària de les TLBs. Específicament, aquesta tesi introdueix un predictor que obté informació d'ús de la pàgina a escala de sistema mitjançant un nivell de TLB compartida (SUP) or mitjançant TLBs cooperant juntes (CUP).Esteve García, A. (2017). Design of Efficient TLB-based Data Classification Mechanisms in Chip Multiprocessors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/86136TESI

    Data sharing in DHT based P2P systems

    Get PDF
    International audienceThe evolution of peer-to-peer (P2P) systems triggered the building of large scale distributed applications. The main application domain is data sharing across a very large number of highly autonomous participants. Building such data sharing systems is particularly challenging because of the "extreme" characteristics of P2P infrastructures: massive distribution, high churn rate, no global control, potentially untrusted participants... This article focuses on declarative querying support, query optimization and data privacy on a major class of P2P systems, that based on Distributed Hash Table (P2P DHT). The usual approaches and the algorithms used by classic distributed systems and databases forproviding data privacy and querying services are not well suited to P2P DHT systems. A considerable amount of work was required to adapt them for the new challenges such systems present. This paper describes the most important solutions found. It also identies important future research trends in data management in P2P DHT systems

    Bounding Worst-Case Response Times of Tasks under PIP

    Get PDF
    Schedulability theory in real-time systems requires prior knowledge of the worst-case execution time (WCET) of every task in the system. One method to determine the WCET is known as static timing analysis. Determination of the priorities among tasks in such a system requires a scheduling policy, which could be either preemptive or nonpreemptive. While static timing analysis and data cache analysis are simplified by using a fully non-preemptive scheduling policy, it results in decreased schedulability. In prior work, a methodology was proposed to bound the data-cache related delay for real-time tasks that, beside having a non-preemptive region (critical section), can otherwise be scheduled preemptively. While the prior approach improves schedulability in comparison to fully non-preemptive methods, it is still conservative in its approach due to its fundamental assumption that a task executing in a critical section may not be preempted by any other task. In this paper, we propose a methodology that incorporates resource sharing policies such as the Priority Inheritance Protocol (PIP) into the calculation of data-cache related delay. In this approach, access to shared resources, which is the primary reason for critical sections within tasks, is controlled by the resource sharing policy. In addition to maintaining correctness of access, such policies strive to limit resource access conflicts, thereby improving the responsiveness of tasks. To the best of our knowledge, this is the first framework that integrates data-cache related delay calculations with resource sharing policies in the context of real-time systems
    • …
    corecore