13 research outputs found

    Improving the SLLC Efficiency by exploiting reuse locality and adjusting prefetch

    Get PDF
    Desde los teléfonos móviles inteligentes hasta nuestro ordenador portátil los sistemas electrónicos que incluyen chips multiprocesador (CMP) están presentes en nuestra vida cotidiana de una manera abrumadora. Los CMPs contienen varios núcleos o CPUs que tienen que ser alimentados con datos provenientes de la memoria. Pero la velocidad a la que los núcleos que forman el CMP necesitan los datos es mucho mayor que la velocidad a la que la memoria es capaz de proporcionar dichos datos. De hecho, esta diferencia ha ido aumentando desde prácticamente el día en el que ambos dispositivos fueron concebidos. Esta diferencia en el rendimiento de ambos dispositivos se ha venido a llamar "the memory gap". Al mismo tiempo que dicha diferencia aumentaba, los lenguajes de programación proporcionaban a los programadores modelos de memoria que podían acceder a un espacio prácticamente infinito y al que, además, se accedía de manera instantánea. Pero el tamaño de cualquier estructura hardware está íntimamente relacionado con su tiempo de acceso y éste será mayor cuanto mayor sea el tamaño la estructura hardware a acceder. Con el ánimo de deshacer esta aparente contradicción, los arquitectos de computadores incluyeron memorias intermedias entre las CPUs y la grande, aunque al mismo tiempo lenta, memoria principal. Estas memorias intermedias se denominan memorias cache o simplemente caches. Debido a la gran diferencia que existe entre la velocidad del procesador y la de la memoria principal. Los CMPs en la actualidad están provistos de una jerarquía de memorias cache que tiene dos o tres niveles. Las caches que están cerca del procesador sólo contienen unos pocos kilobytes (entre 4 y 64) accesibles en uno o pocos ciclos de reloj, mientras que las que se encuentran más alejadas del procesador pueden llegar a contener varios megabytes y tener un tiempo de acceso de varias decenas de ciclos. Los programas al ser ejecutados muestran una propiedad llamada localidad que se expresa en los ejes espacial y temporal. La localidad temporal es la propiedad que dice que el programa volverá a usar datos que usó recientemente, cuanto más recientemente los usó, más probable es que vuelva a hacerlo. Mientras que la localidad espacial es la propiedad que dice que el programa tenderá a usar datos que están próximos en el espacio de memoria a datos que usó recientemente. Las memorias cache han sido diseñadas tradicionalmente para explotar la localidad. En concreto, la localidad temporal se explotaba mediante una adecuada política de reemplazo, mientras que la localidad espacial se explota al contener cada bloque de cache varios datos o palabras. Un modo adicional de conseguir explotar una mayor cantidad de localidad espacial es mediante el uso de la técnica llamada prebúsqueda. La política de reemplazo influye de manera crítica en la tasa de aciertos de la memoria cache. En un CMP provisto de una jerarquía de memorias cache, la localidad temporal se explota en aquellos niveles más cercanos a los núcleos. Así que muchos de los bloques insertados en la SLLC son de un solo uso, es decir, estos bloques no experimentarán ningún acierto más durante todo el tiempo que permanezcan en la SLLC. Sin embargo, aquellos bloques que lleguen a experimentar un acierto en la SLLC, normalmente experimentarán muchos más aciertos. Por lo tanto, que la política de reemplazo base sus decisiones en la posible explotación de la localidad temporal, es una asunción inválida cuando hablamos de la SLLC. Por el contrario, Este comportamiento indica que dicha política de reemplazo de la SLLC debería estar basada en el reúso1 en lugar de en la localidad temporal. La prebúsqueda hardware tiene por objetivo cargar en la cache datos antes de que sea el procesador quien los pida. La validez de esta técnica a la hora de reducir la latencia media de acceso a memoria ha sido ampliamente demostrada. La prebúsqueda funciona especialmente bien en las jerarquías de memoria de sistemas monoprocesador, donde solamente hay un flujo de datos entre el procesador y la memoria. Sin embargo, cuando la prebúsqueda se usa en un sistema multiprocesador donde diferentes aplicaciones se están ejecutando al mismo tiempo, las prebúsquedas asociadas a un núcleo podrían interferir con los datos cargados en la cache por otro núcleo, provocando la eliminación de los contenidos de otra aplicación y dañando su rendimiento. Es necesario por tanto un mecanismo para regular la prebúsqueda asociada a cada uno de los núcleos. Este mecanismo debería tener por objetivo el mejorar el rendimiento general del sistema. 1 Aunque el DRAE no contenga su definición, usaremos aquí el verbo reusar (así como sus formas derivadas) como sinónimo de volver a utilizar. Cada fallo en la SLLC provoca un acceso a la memoria principal que se encuentra fuera del chip. Además la memoria principal está hecha de chips de DRAM. Ambos factores incrementan su latencia de acceso, latencia que se suma a cada uno de los accesos que falla en la SLLC, penalizando a la vez la latencia media de acceso a memoria. Por lo tanto, la tasa de aciertos de la SLLC es un factor crítico para lograr una latencia media de acceso a memoria óptima. Esta tesis fija su atención en la eficiencia de los dos aspectos comentados con anterioridad: la eficiencia de la prebúsqueda y la eficiencia de la política de reemplazo. Las contribuciones principales de esta tesis son las siguientes: 1) Enunciamos una propiedad llamada localidad de reúso que dice que i) los bloques de cache que hayan sido usados más de una vez tienen una alta probabilidad de ser usados muchas veces en el futuro. ii) Los bloques de cache recientemente reusados son más útiles que otros reúsados previamente. Defendemos en esta tesis que el patrón de acceso a la SLLC muestra localidad de reúso. 2) En esta tesis se proponen dos algoritmos de reemplazo capaces de explotar la localidad de reúso, Least-recently reused (LRR) y Not-recently reused (NRR). Estos dos nuevos algoritmos son modificaciones de otros dos muy bien conocidos: Least-recently used (LRU) y Not-recently used (NRU). Dichos algoritmos fueron diseñados para explotar la localidad temporal, mientras que los nuestros explotan la local- idad de reúso. Las modificaciones propuestas no suponen ninguna sobrecarga hardware respecto a los algoritmos base. Durante esta tesis se muestra que nuestros algoritmos mejoran consistentemente el rendimiento de los originales. 3) Proponemos un novedoso diseño para la SLLC llamado Reuse Cache. En este diseño los arrays de etiquetas y datos de la cache están desacoplados. Solamente se almacenan en el array de datos aquellos bloques que hayan mostrado reúso. El array de etiquetas se usa para detectar reúso y mantener la coherencia. Esta estructura permite reducir el tamaño del array de datos de manera drástica. Como ejemplo, una Reuse Cache con un array de etiquetas equivalente al de una cache convencional de 4MB y un array de datos de 1MB, tiene el mismo rendimiento medio que una cache convencional de 8MB, pero con un ahorro de almacenamiento de en torno al 84%. 4) Un controlador de bajo coste llamado ABS capaz de ajustar la agresividad de la prebúsqueda asociada a cada uno de los núcleos de un CMP pero con el ánimo de mejorar el rendimiento general del sistema. El controlador funciona de manera aislada en cada uno de los bancos de la SLLC y recoge métricas locales. Para optimizar el rendimiento global del sistema busca la combinación óptima de valores de la agresividad de prebúsqueda. Para inferir cuál es esa combinación óptima usa una estrategia de búsqueda hill-climbing

    Efficient caching algorithms for memory management in computer systems

    Get PDF
    As disk performance continues to lag behind that of memory systems and processors, fully utilizing memory to reduce disk accesses is a highly effective effort to improve the entire system performance. Furthermore, to serve the applications running on a computer in distributed systems, not only the local memory but also the memory on remote servers must be effectively managed to minimize I/O operations. The critical challenges in an effective memory cache management include: (1) Insightfully understanding and quantifying the locality inherent in the memory access requests; (2) Effectively utilizing the locality information in replacement algorithms; (3) Intelligently placing and replacing data in the multi-level caches of a distributed system; (4) Ensuring that the overheads of the proposed schemes are acceptable.;This dissertation provides solutions and makes unique and novel contributions in application locality quantification, general replacement algorithms, low-cost replacement policy, thrashing protection, as well as multi-level cache management in a distributed system. First, the dissertation proposes a new method to quantify locality strength, and accurately to identify the data with strong locality. It also provides a new replacement algorithm, which significantly outperforms existing algorithms. Second, considering the extremely low-cost requirements on replacement policies in virtual memory management, the dissertation proposes a policy meeting the requirements, and considerably exceeding the performance existing policies. Third, the dissertation provides an effective scheme to protect the system from thrashing for running memory-intensive applications. Finally, the dissertation provides a multi-level block placement and replacement protocol in a distributed client-server environment, exploiting non-uniform locality strengths in the I/O access requests.;The methodology used in this study include careful application behavior characterization, system requirement analysis, algorithm designs, trace-driven simulation, and system implementations. A main conclusion of the work is that there is still much room for innovation and significant performance improvement for the seemingly mature and stable policies that have been broadly used in the current operating system design

    Algorithms incorporating concurrency and caching

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 189-203).This thesis describes provably good algorithms for modern large-scale computer systems, including today's multicores. Designing efficient algorithms for these systems involves overcoming many challenges, including concurrency (dealing with parallel accesses to the same data) and caching (achieving good memory performance.) This thesis includes two parallel algorithms that focus on testing for atomicity violations in a parallel fork-join program. These algorithms augment a parallel program with a data structure that answers queries about the program's structure, on the fly. Specifically, one data structure, called SP-ordered-bags, maintains the series-parallel relationships among threads, which is vital for uncovering race conditions (bugs) in the program. Another data structure, called XConflict, aids in detecting conflicts in a transactional-memory system with nested parallel transactions. For a program with work T and span To, maintaining either data structure adds an overhead of PT, to the running time of the parallel program when executed on P processors using an efficient scheduler, yielding a total runtime of O(T1/P + PTo). For each of these data structures, queries can be answered in 0(1) time. This thesis also introduces the compressed sparse rows (CSB) storage format for sparse matrices, which allows both Ax and ATx to be computed efficiently in parallel, where A is an n x n sparse matrix with nnz > n nonzeros and x is a dense n-vector. The parallel multiplication algorithm uses e(nnz) work and ... span, yielding a parallelism of ... , which is amply high for virtually any large matrix.(cont.) Also addressing concurrency, this thesis considers two scheduling problems. The first scheduling problem, motivated by transactional memory, considers randomized backoff when jobs have different lengths. I give an analysis showing that binary exponential backoff achieves makespan V2e(6v 1- i ) with high probability, where V is the total length of all n contending jobs. This bound is significantly larger than when jobs are all the same size. A variant of exponential backoff, however, achieves makespan of ... with high probability. I also present the size-hashed backoff protocol, specifically designed for jobs having different lengths, that achieves makespan ... with high probability. The second scheduling problem considers scheduling n unit-length jobs on m unrelated machines, where each job may fail probabilistically. Specifically, an input consists of a set of n jobs, a directed acyclic graph G describing the precedence constraints among jobs, and a failure probability qij for each job j and machine i. The goal is to find a schedule that minimizes the expected makespan. I give an O(log log(min {m, n}))-approximation for the case of independent jobs (when there are no precedence constraints) and an O(log(n + m) log log(min {m, n}))-approximation algorithm when precedence constraints form disjoint chains. This chain algorithm can be extended into one that supports precedence constraints that are trees, which worsens the approximation by another log(n) factor. To address caching, this thesis includes several new variants of cache-oblivious dynamic dictionaries.(cont.) A cache-oblivious dictionary fills the same niche as a classic B-tree, but it does so without tuning for particular memory parameters. Thus, cache-oblivious dictionaries optimize for all levels of a multilevel hierarchy and are more portable than traditional B-trees. I describe how to add concurrency to several previously existing cache-oblivious dictionaries. I also describe two new data structures that achieve significantly cheaper insertions with a small overhead on searches. The cache-oblivious lookahead array (COLA) supports insertions/deletions and searches in O((1/B) log N) and O(log N) memory transfers, respectively, where B is the block size, M is the memory size, and N is the number of elements in the data structure. The xDict supports these operations in O((1/1B E1-) logB(N/M)) and O((1/)0logB(N/M)) memory transfers, respectively, where 0 < E < 1 is a tunable parameter. Also on caching, this thesis answers the question: what is the worst possible page-replacement strategy? The goal of this whimsical chapter is to devise an online strategy that achieves the highest possible fraction of page faults / cache misses as compared to the worst offline strategy. I show that there is no deterministic strategy that is competitive with the worst offline. I also give a randomized strategy based on the most recently used heuristic and show that it is the worst possible pagereplacement policy. On a more serious note, I also show that direct mapping is, in some sense, a worst possible page-replacement policy. Finally, this thesis includes a new algorithm, following a new approach, for the problem of maintaining a topological ordering of a dag as edges are dynamically inserted.(cont.) The main result included here is an O(n2 log n) algorithm for maintaining a topological ordering in the presence of up to m < n(n - 1)/2 edge insertions. In contrast, the previously best algorithm has a total running time of O(min { m3/ 2, n5/2 }). Although these algorithms are not parallel and do not exhibit particularly good locality, some of the data structural techniques employed in my solution are similar to others in this thesis.by Jeremy T. Fineman.Ph.D

    Study of Fine-Grained, Irregular Parallel Applications on a Many-Core Processor

    Get PDF
    This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. Biconnectivity: For finding the biconnected components of a graph, we demonstrate speedups of 9x to 33x on XMT relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two contemporary NVIDIA GPUs of similar or greater silicon area. Prior studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Triconnectivity: We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups on XMT over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Burrows-Wheeler compression: We present novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet and their empirical evaluation. To validate these theoretical algorithms, we implement them on XMT and show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Fast Fourier transform (FFT): Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster. Boosted decision trees: This chapter focuses on the hybrid memory architecture of the XMT computer platform, a key part of which is a flexible all-to-all interconnection network that connects processors to shared memory modules. First, to understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. Then, we contrast the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date for XGBoost, an implementation of boosted decision trees, a timely machine learning approach. Boolean satisfiability (SAT): SAT is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers. We show the potential for speedups of up to 382X across a variety of problem instances. We hope that these results will stimulate future research

    Vers la Compression à Tous les Niveaux de la Hiérarchie de la Mémoire

    Get PDF
    Hardware compression techniques are typically simplifications of software compression methods. They must, however, comply with area, power and latency constraints. This study unveils the challenges of adopting compression in memory design. The goal of this analysis is not to summarize proposals, but to put in evidence the solutions they employ to handle those challenges. An in-depth description of the main characteristics of multiple methods is provided, as well as criteria that can be used as a basis for the assessment of such schemes.Typically, these schemes are not very efficient, and those that do compress well decompress slowly. This work explores their granularity to redefine their perspectives and improve their efficiency, through a concept called Region-Chunk compression. Its goal is to achieve low (good) compression ratio and fast decompression latency. The key observation is that by further sub-dividing the chunks of data being compressed one can reduce data duplication. This concept can be applied to several previously proposed compressors, resulting in a reduction of their average compressed size. In particular, a single-cycle-decompression compressor is boosted to reach a compressibility level competitive to state-of-the-art proposals.Finally, to increase the probability of successfully co-allocating compressed lines, Pairwise Space Sharing (PSS) is proposed. PSS can be applied orthogonally to compaction methods at no extra latency penalty, and with a cost-effective metadata overhead. The proposed system (Region-Chunk+PSS) further enhances the normalized average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.Les techniques de compression matérielle sont généralement des simplifications des méthodes de compression logicielle. Elles doivent, toutefois, se conformer aux contraintes de surface, de puissance et de latence. Cette étude dévoile les défis de l’adoption de la compression dans la conception de la mémoire. Le but de l’analyse n’est pas de résumer les propositions, mais de mettre en évidence les solutions qu’ils emploient pour relever ces défis. Une description détaillée des principales caractéristiques de plusieurs méthodes est fournie, ainsi que des critères qui peuvent être utilisés comme base pour l’évaluation de ces systèmes.Généralement, ces schémas ne sont pas très efficaces, et les schémas qui compressent bien décompressent lentement. Ce travail explore leur granularité pour redéfinir leurs perspectives et améliorer leur efficacité, à travers un concept appelé compression Region-Chunk. Son objectif est d’obtenir un haut (bon) taux de compression et une latence de décompression rapide. L’observation clé est qu’en subdivisant davantage les blocs de données compressés, on peut réduire la duplication des données. Ce concept peut être appliqué à plusieurs compresseurs précédemment proposés, entraînant une réduction de leur taille moyenne compressée. En particulier, un compresseur à décompression à cycle unique est boosté pour atteindre un niveau de compressibilité compétitif par rapport aux propositions de pointe.Enfin, pour augmenter la probabilité de co-allouer avec succès des lignes compressées, Pairwise Space Sharing (PSS) est proposé. PSS peutêtre appliqué orthogonalement aux méthodes de compactage sans pénalité de latence supplémentaire, et avec une surcharge de métadonnées rentable. Le système proposé (Region-Chunk + PSS) améliore encore la capacité normalisé moyenne du cache de 2,7% (moyenne géométrique), tout en offrant une courte latence de décompression

    Tiled microprocessors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D

    The Proceeding Of The 1st International Conference Technology on Biosciences and Social Science 2016: “Industry Based On Knowledges

    Get PDF
    The Proceeding Of&nbsp;The 1st&nbsp;International Conference Technology on&nbsp;Biosciences and Social Science 2016 &nbsp;Theme: “Industry Based On Knowledges” 17th– 19th&nbsp;November 2016, Convention Hall, Andalas&nbsp;University, Padang, West Sumatera, Indonesia Organized by&nbsp; Animal Science Faculty of Andalas University&nbsp;and Alumbi Center of Universiti Putra Malaysia&nbsp; &nbsp

    Department of Defense Dictionary of Military and Associated Terms

    Get PDF
    The Joint Publication 1-02, Department of Defense Dictionary of Military and Associated Terms sets forth standard US military and associated terminology to encompass the joint activity of the Armed Forces of the United States. These military and associated terms, together with their definitions, constitute approved Department of Defense (DOD) terminology for general use by all DOD components

    Proceedings of DRS Learn X Design 2019: Insider Knowledge

    Get PDF
    corecore