692 research outputs found

    Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi

    Full text link
    Intel Xeon Phi is a recently released high-performance coprocessor which features 61 cores each supporting 4 hardware threads with 512-bit wide SIMD registers achieving a peak theoretical performance of 1Tflop/s in double precision. Many scientific applications involve operations on large sparse matrices such as linear solvers, eigensolver, and graph mining algorithms. The core of most of these applications involves the multiplication of a large, sparse matrix with a dense vector (SpMV). In this paper, we investigate the performance of the Xeon Phi coprocessor for SpMV. We first provide a comprehensive introduction to this new architecture and analyze its peak performance with a number of micro benchmarks. Although the design of a Xeon Phi core is not much different than those of the cores in modern processors, its large number of cores and hyperthreading capability allow many application to saturate the available memory bandwidth, which is not the case for many cutting-edge processors. Yet, our performance studies show that it is the memory latency not the bandwidth which creates a bottleneck for SpMV on this architecture. Finally, our experiments show that Xeon Phi's sparse kernel performance is very promising and even better than that of cutting-edge general purpose processors and GPUs

    Low-Power and Low-Noise Clock Generator for High-Speed ADCs

    Get PDF
    The rapid development of high-performance communication technologies reflects a clear trend in demanding requirements imposed on analog-to-digital converters (ADCs). Thus, it appears that these requirements imply higher frequencies not only for the input signal but also higher sampling frequencies, which translates into a higher sensitivity of the circuit to thermal noise and consequent increase in phase-noise. This arises as to the main purpose of this document, which will seek, as its main objective, the development of an architecture that allows the generation of multiple clock signals at high input frequencies with low jitter and low power dissipation to make ADCs more efficient and faster. This dissertation proposes an architecture implemented by a Clock Buffer that converts a differential input signal into a single-ended output signal, a Digital Buffer that transforms a sine wave into a square wave, and finally a Multi Clock Phase Generator (MPCG), consisting of Shift Registers. Both architectures are implemented in 130 nm CMOS technology. The architecture is powered by a LVDS signal with an amplitude of 200 mV and a frequency of 1 GHz, in order to output 8 square wave clock signals with an amplitude of 1.2 V and with a frequency of 125 MHz. The signals obtained at the output later will feed an architecture of 8 Time-Interleaved ADCs. The total area of the implemented circuit is about 8054.3 μm2, for a dissipated power of 5.3 mW and a jitter value of 1.13 ps. This new architecture will be aimed at all types of entities that work with devices that are made up of high-speed performance ADCs, to improve the operation of these same devices, making the processing from a continuous signal to a discrete signal as efficiently as possible.O rápido desenvolvimento das tecnologias de comunicação de alto desempenho, reflete uma tendência clara na exigência dos requisitos impostos aos conversores analógico-digital (ADCs). Deste modo, verifica-se que estes requisitos implicam elevadas frequências não só sinal de entrada, como também frequências elevadas de amostragem o que se traduz numa maior sensibilidade do circuito ao ruído térmico e consequente aumento ruído de fase. Esta problemática, surge como propósito principal deste documento, no qual se procurará, como objetivo principal, o desenvolvimento de uma arquitetura que permita gerar múltiplos sinais de relógio a altas frequências de entrada e períodos de amostragem, com um baixo jitter e baixa energia consumida de forma a tornar mais eficiente e rápido o funcionamento de ADCs. Ruido térmico. Esta dissertação propõe uma arquitetura composta por um amplificador de sinal de relógio que converte o duplo sinal de entrada num único sinal de saída, um amplificador digital que transforma uma onda sinusoidal numa onda quadrada e por fim um gerador de fase múltipla de sinais de relógio (MPCG), constituído por registos de deslocamento. Ambas as arquiteturas são implementadas em tecnologia CMOS de 130 nm. A arquitetura é alimentada com um sinal LVDS de 200 mV de amplitude e com uma frequência de 1 GHz, de forma a obter à saída 8 sinais de relógio de onda quadrada com uma amplitude de 1,2 V e com 125 MHz de frequência. Os sinais obtidos à saída posteriormente alimentarão uma arquitetura de 8 canais com multiplexagem temporal. A área total do circuito implementado é cerca de 8054,3 μm2, para uma potência dissipada de 5,3 mW e para um valor de jitter de 1,13 ps. Esta nova arquitetura será direcionada para todo o tipo de entidades que trabalham com dispositivos que são constituídos por ADCs de alta velocidade de desempenho, de forma a poder melhorar o funcionamento desses mesmos dispositivos, tornando o processamento de sinal continuo para sinal discreto o mais eficiente possível

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    An Scalable matrix computing unit architecture for FPGA and SCUMO user design interface

    Get PDF
    High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations' performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N x N matrices, the architecture requires N ALU-RAM blocks and performs O(N*N), requiring N*N +7 and N +7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 x 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units

    Incremental closeness centrality in distributed memory

    Get PDF
    Networks are commonly used to model traffic patterns, social interactions, or web pages. The vertices in a network do not possess the same characteristics: some vertices are naturally more connected and some vertices can be more important. Closeness centrality (CC) is a global metric that quantifies how important is a given vertex in the network. When the network is dynamic and keeps changing, the relative importance of the vertices also changes. The best known algorithm to compute the CC scores makes it impractical to recompute them from scratch after each modification. In this paper, we propose Streamer, a distributed memory framework for incrementally maintaining the closeness centrality scores of a network upon changes. It leverages pipelined, replicated parallelism, and SpMM-based BFSs, and it takes NUMA effects into account. It makes maintaining the Closeness Centrality values of real-life networks with millions of interactions significantly faster and obtains almost linear speedups on a 64 nodes 8 threads/node cluster

    Adaptive memory hierarchies for next generation tiled microarchitectures

    Get PDF
    Les últimes dècades el rendiment dels processadors i de les memòries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferència de rendiment és un camp d'investigació d'actualitat i que requereix de noves sol·lucions. Una sol·lució a aquest problema són les memòries “cache”, que permeten reduïr l'impacte d'unes latències de memòria creixents i que conformen la jerarquia de memòria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, però, el creixent nombre de transistors disponible per xip ha permès l'aparició de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memòria específiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiència energètica de la jerarquia de memòria per CMPs, des de les “caches” fins als controladors de memòria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bé per a algunes aplicacions, un sistema que s'ajustés dinàmicament seria més eficient. Tècniques com el Cooperative Caching (CC) combinen els avantatges de les dues tècniques però requereixen un mecanisme centralitzat de coherència que té un consum energètic molt elevat. És per això que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherència en CMPs i aplica el concepte del cooperative caching de forma distribuïda. Mitjançant l'ús de directoris distribuïts s'obté una sol·lució més escalable i que, a més, disposa d'un mecanisme de marcatge més flexible i eficient energèticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribució de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organització capaç de redistribuïr la memòria “cache” dinàmicament segons els requeriments de cada aplicació. Una de les contribucions més importants d'aquesta tècnica és que la reconfiguració es decideix completament a través del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuïdes, permetent una millor escalabilitat. ElasticCC no només és capaç de reparticionar les “caches” segons els requeriments de cada aplicació, sinó que, a més a més, és capaç d'adaptar-se a les diferents fases d'execució de cada una d'elles. La nostra avaluació també demostra que la reconfiguració dinàmica de l'ElasticCC és tant eficient que gairebé proporciona la mateixa taxa de fallades que una configuració amb el doble de memòria.Finalment, la tesi es centra en l'estudi del comportament de les memòries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accés obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat múltiples sol·lucions per CMPs però totes elles es veuen limitades per un compromís entre el rendiment global i l'equitat en l'assignació de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memòries DRAM que permetria guardar files de dades específiques per a cada aplicació. Aquest mecanisme permet proporcionar un accés equitatiu a la memòria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memòria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tècniques proposades proporcionen un millor rendiment i eficiència energètica que les millors tècniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions

    Design of sigma-delta modulators for analog-to-digital conversion intensively using passive circuits

    Get PDF
    This thesis presents the analysis, design implementation and experimental evaluation of passiveactive discrete-time and continuous-time Sigma-Delta (ΣΔ) modulators (ΣΔMs) analog-todigital converters (ADCs). Two prototype circuits were manufactured. The first one, a discrete-time 2nd-order ΣΔM, was designed in a 130 nm CMOS technology. This prototype confirmed the validity of the ultra incomplete settling (UIS) concept used for implementing the passive integrators. This circuit, clocked at 100 MHz and consuming 298 μW, achieves DR/SNR/SNDR of 78.2/73.9/72.8 dB, respectively, for a signal bandwidth of 300 kHz. This results in a Walden FoMW of 139.3 fJ/conv.-step and Schreier FoMS of 168 dB. The final prototype circuit is a highly area and power efficient ΣΔM using a combination of a cascaded topology, a continuous-time RC loop filter and switched-capacitor feedback paths. The modulator requires only two low gain stages that are based on differential pairs. A systematic design methodology based on genetic algorithm, was used, which allowed decreasing the circuit’s sensitivity to the circuit components’ variations. This continuous-time, 2-1 MASH ΣΔM has been designed in a 65 nm CMOS technology and it occupies an area of just 0.027 mm2. Measurement results show that this modulator achieves a peak SNR/SNDR of 76/72.2 dB and DR of 77dB for an input signal bandwidth of 10 MHz, while dissipating 1.57 mW from a 1 V power supply voltage. The ΣΔM achieves a Walden FoMW of 23.6 fJ/level and a Schreier FoMS of 175 dB. The innovations proposed in this circuit result, both, in the reduction of the power consumption and of the chip size. To the best of the author’s knowledge the circuit achieves the lowest Walden FOMW for ΣΔMs operating at signal bandwidth from 5 MHz to 50 MHz reported to date

    Kajian Konsep Aksesibilitas Pada SLB Negeri Bekasi Jaya

    Get PDF
    Sekolah merupakan sebuah jenjang pendidikan yang mampu menghantarkan manusia kepada kualitas kehidupan bermasyarakat yang lebih baik. Jenjang pendidikan yang telah diwajikan dalam penempuhannya menurut pemerintah hingga sekarang ini ialah 12 tahun wajib belajar. Dimulai dari Sekolah Dasar, Sekolah Menengah Pertama, dan Sekolah Menengah Atas, Sekolah Menengah Kejuruan atau sederajat. Kebijakan wajib belajar 12 tahun ini. Arsitektur berperan sebagai ruang aktivitas manusia yang menciptakan hubungan antara ruang dalam dan ruang luar. Ruang dalam arsitektur dapat terbentuk karena persepsi dan imajinasi manusia sebagai pengguna. Konsep aksesibilitas merupakan tingkat kemudahan yang dicapai oleh seseorang terhadap suatu objek di lingkungannya untuk menghasilkan gambaran konsep aksesibilitas pada SLB Negeri Bekasi Jaya.
    corecore