347 research outputs found

    Fast Differentially Private Matrix Factorization

    Full text link
    Differentially private collaborative filtering is a challenging task, both in terms of accuracy and speed. We present a simple algorithm that is provably differentially private, while offering good performance, using a novel connection of differential privacy to Bayesian posterior sampling via Stochastic Gradient Langevin Dynamics. Due to its simplicity the algorithm lends itself to efficient implementation. By careful systems design and by exploiting the power law behavior of the data to maximize CPU cache bandwidth we are able to generate 1024 dimensional models at a rate of 8.5 million recommendations per second on a single PC

    Prefetching in functional languages

    Get PDF
    Functional programming languages contain a number of runtime and language features, such as garbage collection, indirect memory accesses, linked data structures and immutability, that interact with a processor’s memory system. These conspire to cause a variety of unintuitive memory performance effects. For example, it is slower to traverse through linked lists and arrays of data that have been sorted than to traverse the same data accessed in the order it was allocated. We seek to understand these issues and mitigate them in a manner consistent with functional languages, taking advantage of the features themselves where possible. For example, immutability and garbage collection force linked lists to be allocated roughly sequentially in memory, even when the data pointed to within each node is not. We add language primitives for software-prefetching to the OCaml language to exploit this, and observe significant performance improvements a variety of micro- and macro-benchmarks, resulting in speedups of up to 2× on the out-of-order superscalar Intel Haswell and Xeon Phi Knights Landing systems, and up to 3× on the in-order Arm Cortex-A53.Arm Limite

    A Hybrid Instruction Prefetching Mechanism for Ultra Low-Power Multicore Clusters

    Get PDF
    The instruction memory hierarchy plays a critical role in performance and energy efficiency of ultralow-power (ULP) processors for the Internet-of-Things (IoT) end-nodes. This is mainly due to the extremely tight power envelope and area budgets, which imply small instruction-caches (I-Cache) operating at very low supply voltages (near-threshold). The challenge is aggravated by the fact that multiple processors, fetching in parallel, require plenty of bandwidth from the I-Caches. In this letter, we propose a low-cost and energy efficient hybrid instruction-prefetching mechanism to be integrated with a ULP multicore cluster. We study its performance for a wide range of IoT applications, from cryptography to computer vision, and show that it can effectively improve the hit-rate of almost all of them to above 95% (average performance improvement of over 2 \times ). In addition, we designed our prefetcher and integrated it in a 4-cores cluster in 28 nm fully-depleted silicon-on-insulator (FDSOI) technology. We show that system's power consumption increases only by about 11% and silicon area by less than 1%. Altogether, a total energy reduction of 1.9x is achieved, thanks to more than 2x performance improvement, enabling a significantly longer battery life

    Prebúsqueda adaptativa en un chip multiprocesador

    Get PDF
    La prebúsqueda agresiva ha demostrado ser una técnica eficiente para mejorar el rendimiento de los sistemas monoprocesador. Sin embargo, en sistemas multiprocesador con un último nivel de memoria cache compartido (LLC), la actividad de prebúsqueda inducida por un núcleo consume recursos comunes como espacio en la LLC y ancho de banda. Esto puede degradar el rendimiento del resto de núcleos e incluso el rendimiento general del sistema. Por tanto, la prebúsqueda hardware en un multiprocesador que tiene un último nivel de cache compartido (LLC) es un reto. En este trabajo presentamos ABS, un mecanismo de bajo coste que adecúa la agresividad de la prebúsqueda de cada uno de los núcleos en cada uno de los bancos de la LLC de un chip multiprocesador. El mecanismo se ejecuta de forma independiente en cada banco de la LLC usando sólo información local. A intervalos temporales regulares un núcleo es seleccionado y la tasa de fallos del banco y la utilidad de la prebúsqueda de dicho núcleo son muestreadas. Estas métricas son utilizadas para ajustar la agresividad de la prebúsqueda asociada al núcleo elegido. Nuestros análisis con cargas multiprogramadas de SPEC2K6 muestran que el mecanismo mejora tanto las métricas de usuario (el tiempo medio de retorno un 27% y la equidad un 11%) como las de sistema (la productividad agregada mejora un 22% y el ancho de banda consumido se reduce un 14%) con respecto a un sistema base con ocho núcleos que usa prebúsqueda secuencial marcada de grado fijo. Los resultados son consistentes cuando se utiliza un sistema con dieciséis núcleos o cuando comparamos nuestro mecanismo con propuestas previas

    An Event-Triggered Programmable Prefetcher for Irregular Workloads

    Get PDF
    Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.</jats:p

    Cache Conscious Data Layouting for In-Memory Databases

    Get PDF
    Many applications with manually implemented data management exhibit a data storage pattern in which semantically related data items are stored closer in memory than unrelated data items. The strong sematic relationship between these data items commonly induces contemporary accesses to them. This is called the principle of data locality and has been recognized by hardware vendors. It is commonly exploited to improve the performance of hardware. General Purpose Database Management Systems (DBMSs), whose main goal is to simplify optimal data storage and processing, generally fall short of this claim because the usage pattern of the stored data cannot be anticipated when designing the system. The current interest in column oriented databases indicates that one strategy does not fit all applications. A DBMS that automatically adapts it’s storage strategy to the workload of the database promises a significant performance increase by maximizing the benefit of hardware optimizations that are based on the principle of data locality. This thesis gives an overview of optimizations that are based on the principle of data locality and the effect they have on the data access performance of applications. Based on the findings, a model is introduced that allows an estimation of the costs of data accesses based on the arrangement of the data in the main memory. This model is evaluated through a series of experiments and incorporated into an automatic layouting component for a DBMS. This layouting component allows the calculation of an analytically optimal storage layout. The performance benefits brought by this component are evaluated in an application benchmark
    • …
    corecore