307 research outputs found

    Enif-lang: A specialized language for programming network functions on commodity hardware

    Get PDF
    The maturity level reached by today’s commodity platforms makes even low-cost PCs viable alternatives to dedicated hardware to implement real network functions without sacrificing performance. Indeed, the availability of multi-core processing packages and multi-queue network interfaces that can be managed by accelerated I/O frameworks, provides off-the-shelf servers with the necessary power capability for running a broad variety of network applications with near hardware-class performance. At the same time, the introduction of the Software Defined Networks (SDN) and the Network Functions Virtualization (NFV) paradigms call for new programming abstractions and tools to allow this new class of network devices to be flexibly configured and functionally repurposed from the network control plane. The paper presents the ongoing work towards Enif-Lang (Enhanced Network processIng Functional Language), a functional language for programming network functions over generic middleboxes running the Linux operating system. The language addresses concurrent programming by design and is targeted at developing simple stand-alone applications as well as pre-processing stages of packet elaborations. Enif-Lang is implemented as a Domain Specific Language embedded in the Haskell language and inherits the main principles of its ancestor, including the strong typedness and the concept of function compositions. Complex network functions are implemented by composing a set of elementary operations (primitives) by means of a compact yet expressive language grammar. Throughout the paper, the description of the design principles and features of Enif-Lang are accompanied by examples and use cases. In addition, a preliminary performance assessment is carried out to prove the effectiveness of the language for developing practical applications with the performance level required by 5G systems and the Tactile Internet

    Direct Inter-Process Communication (dIPC): Repurposing the CODOMs architecture to accelerate IPC

    Get PDF
    In current architectures, page tables are the fundamental mechanism that allows contemporary OSs to isolate user processes, binding each thread to a specific page table. A thread cannot therefore directly call another process's function or access its data; instead, the OS kernel provides data communication primitives and mediates process synchronization through inter-process communication (IPC) channels, which impede system performance. Alternatively, the recently proposed CODOMs architecture provides memory protection across software modules. Threads can cross module protection boundaries inside the same process using simple procedure calls, while preserving memory isolation. We present dIPC (for "direct IPC"), an OS extension that repurposes and extends the CODOMs architecture to allow threads to cross process boundaries. It maps processes into a shared address space, and eliminates the OS kernel from the critical path of inter-process communication. dIPC is 64.12× faster than local remote procedure calls (RPCs), and 8.87× faster than IPC in the L4 microkernel. We show that applying dIPC to a multi-tier OLTP web server improves performance by up to 5.12× (2.13× on average), and reaches over 94% of the ideal system efficiency.We thank Diego Marr´on for helping with MariaDB, the anonymous reviewers for their feedback and, especially, Andrew Baumann for helping us improve the paper. This research was partially funded by HiPEAC through a collaboration grant for Lluís Vilanova (agreement number 687698 for the EU’s Horizon2020 research and innovation programme), the Israel Science Fundation (ISF grant 769/12) and the Israeli Ministry of Science, Technology and Space.Peer ReviewedPostprint (author's final draft

    Have Your Cake and Eat It? Productive Parallel Programming via Chapel’s High-level Constructs

    Get PDF
    Explicit parallel programming is required to utilize the growing parallelism in computer hardware. However, current mainstream parallel notations, such as OpenMP and MPI, lack in programmability. Chapel tries to tackle this problem by providing high-level constructs. However, the performance implication of such constructs is not clear, and needs to be evaluated. The key contributions of this work are: 1. An evaluation of data parallelism and global-view programming in Chapel through the reduce and transpose benchmarks. 2. Identification of bugs in Chapel runtime code with proposed fixes. 3. A benchmarking framework that aids in conducting systematic and rigorous performance evaluation. Through examples, I show that data parallelism and global-view programming lead to clean and succinct code in Chapel. In the reduce benchmark, I found that data parallelism makes Chapel outperform the baseline. However, in the transpose benchmark, I found that global-view programming causes performance degradation in Chapel due to frequent implicit communication. I argue that this is not an inherent problem with Chapel, and can be solved by compiler optimizations. The results suggest that it is possible to use high-level abstraction in parallel languages to improve the productivity of programmers, while still delivering competitive performance. Furthermore, the benchmarking framework I developed can aid the wider research community in performance evaluations

    Programming self developing blob machines for spatial computing.

    Get PDF

    Memory hierarchies for future HPC architectures

    Get PDF
    Efficiently managing the memory subsystem of modern multi/manycore architectures is increasingly becoming a challenge as systems grow in complexity and heterogeneity. In the field of high performance computing (HPC) in particular, where massively parallel architectures are used and input sets of several terabytes are common, careful management of the memory hierarchy is crucial to exploit the full computing power of these systems. The goal of this thesis is to provide computer architects with valuable information to guide the design of future systems, and in particular of those more widely used in the field of HPC, i.e., symmetric multicore processors (SMPs) and GPUs. With that aim, we present an analysis of some of the inefficiencies and shortcomings of current memory management techniques and propose two novel schemes leveraging the opportunities that arise from the use of new and emerging programming models and computing paradigms. The first contribution of this thesis is a block prefetching mechanism for task-based programming models. Using a task-based programming model simplifies parallel programming and allows for better resource utilization in the supercomputers used in the field of HPC, while enabling sophisticated memory management techniques. The scheme proposed relies on a memory-aware runtime system to guide prefetching while avoiding the main drawbacks of traditional prefetching mechanisms, i.e., cache pollution and lack of timeliness. It leverages the information provided by the user about tasks¿ input data to prefetch contiguous blocks of memory that are certain to be useful. The proposed scheme targets SMPs with large cache hierarchies and uses heuristics to dynamically decide the best cache level to prefetch into without evicting useful data. The focus of this thesis then turns to heterogeneous architectures combining GPUs and traditional multicore processors. The current trend towards tighter coupling of GPU and CPU enables new collaborative computations that tax the memory subsystem in a different manner than previous heterogeneous computations did, and requires careful analysis to understand the trade-offs that are to be expected when designing future memory organizations. The second contribution is an in-depth analysis on the impact of sharing the last-level cache between GPU and CPU cores on a system where the GPU is integrated on the same die as the CPU. The analysis focuses on the effect that a shared cache can have on collaborative computations where GPU and CPU threads concurrently work on a problem and share data at fine granularities. The results presented here show that sharing the last-level cache is largely beneficial as it allows for better resource utilization. In addition, the evaluation shows that collaborative computations benefit significantly from the faster CPU-GPU communication and higher cache hit rates that a shared cache level provides. The final contribution of this thesis analyzes the inefficiencies and drawbacks of demand paging as currently implemented in discrete GPUs by NVIDIA. Then, it proposes a novel memory organization and dynamic migration scheme that allows for efficient data sharing between GPU and CPU, specially when executing collaborative computations where data is migrated back and forth between the two separate memories. This scheme migrates data at cache line granularities transparently to the user and operating system, avoiding false sharing and the unnecessary data transfers that occur with demand paging. The results show that the proposed scheme is able to outperform the baseline system by reducing the migration latency of data that is copied multiple times between the two memories. In addition, analysis of different interconnect latencies shows that fine-grained data sharing between GPU and CPU is feasible as long as future interconnect technologies achieve four to five times lower round-trip times than PCI-Express 3.0.La gestión eficiente del subsistema de memoria se ha convertido en un problema complejo a la vez que los sistemas crecen en complejidad y heterogeneidad. En el campo de la computación de altas prestaciones (HPC) en particular, donde arquitecturas masivamente paralelas son usadas y entradas de varios terabytes son comunes, una gestión cuidadosa de la jerarquía de memoria es crucial para conseguir explotar todo el potencial de estas arquitecturas. El objetivo de esta tesis es proporcionar a los arquitectos de computadores información valiosa para el diseño de los sistemas del futuro, y en concreto de los más comúnmente usados en el campo de HPC, los procesadores multinúcleo simétricos (SMP) y las tarjetas gráficas (GPU). Para ello, presentamos un análisis de las ineficiencias y los inconvenientes de los sistemas de gestión de memoria actuales, y proponemos dos técnicas nuevas que aprovechan las oportunidades surgidas del uso de nuevos y emergentes modelos de programación y paradigmas de computación. La primera contribución de esta tesis es un mecanismo de prefetch de bloques para modelos de programación basados en tareas. Usando modelos de programación orientados a tareas simplifica la programación paralela y permite hacer un mejor uso de los recursos en los supercomputadores usados en HPC, mientras permiten el uso de sofisticados mecanismos de gestión de memoria. La técnica propuesta se basa en un sistema de runtime para guiar el prefetch de datos mientras evita los principales inconvenientes tradicionalmente asociados con prefetching, la polución de cache y la medida incorrecta de los tiempos. El mecanismo utiliza la información sobre las entradas de las tareas proporcionada por el usuario para prefetchear bloques contiguos de memoria sobre los que hay certeza que serán utilizados. El mecanismo está dirigido a arquitecturas SMP con amplias jerarquías de cache, y usa heurísticas para decidir dinámicamente en qué nivel de caché colocar los datos sin desplazar datos útiles. El focus de la tesis gira luego a arquitecturas heterogéneas que combinan GPUs con procesadores multinúcleo tradicionales. La actual tendencia a unir GPU y CPU permite el uso de una nueva serie de computaciones colaborativas que afectan al subsistema de memoria de forma diferente que las computaciones heterogéneas anteriores, y requiere de un cuidadoso análisis para entender las consecuencias que esto tiene en el diseño de las organizaciones de memoria futuras. La segunda contribución de la tesis es un análisis detallado del impacto que supone compartir el último nivel de cache entre núcleos de GPU y CPU en sistemas donde la GPU está integrada en el mismo chip que la CPU. El análisis se centra en el efecto que la cache compartida tiene en colaboraciones colaborativas donde hilos de GPU y CPU trabajan concurrentemente en un problema y comparten datos a grano fino. Los resultados presentados en esta tesis muestran que compartir el último nivel de cache es mayormente beneficioso ya que permite un mejor uso de los recursos. Además, la evaluación muestra que las computaciones colaborativas se benefician en gran medida de la comunicación más rápida entre GPU y CPU y las mayores tasas de acierto de cache que un nivel de cache compartido proporcionan
    • …
    corecore