622 research outputs found

    Doctor of Philosophy

    Get PDF
    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

    A cross-stack, network-centric architectural design for next-generation datacenters

    Get PDF
    This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles: (1) utilizing commodity, off-the-shelf hardware (i.e., processor, DRAM, and network devices) with minimal changes to their architecture, and (2) providing a standard interface to the programmers for using the novel hardware. More specifically, the proposed datacenter architecture enables a smart network adapter to collectively compress/decompress data exchange between distributed DNN training nodes and assist the operating system in performing aggressive processor power management. It also deploys specialized memory modules in the servers, capable of performing general-purpose computation and network connectivity. This thesis unlocks the potentials of hardware and operating system co-design in architecting application-transparent, near-data processing hardware for improving datacenter's performance, energy efficiency, and scalability. We evaluate the proposed datacenter architecture using a combination of full-system simulation, FPGA prototyping, and real-system experiments

    NFV Platforms: Taxonomy, Design Choices and Future Challenges

    Get PDF
    Due to the intrinsically inefficient service provisioning in traditional networks, Network Function Virtualization (NFV) keeps gaining attention from both industry and academia. By replacing the purpose-built, expensive, proprietary network equipment with software network functions consolidated on commodity hardware, NFV envisions a shift towards a more agile and open service provisioning paradigm. During the last few years, a large number of NFV platforms have been implemented in production environments that typically face critical challenges, including the development, deployment, and management of Virtual Network Functions (VNFs). Nonetheless, just like any complex system, such platforms commonly consist of abounding software and hardware components and usually incorporate disparate design choices based on distinct motivations or use cases. This broad collection of convoluted alternatives makes it extremely arduous for network operators to make proper choices. Although numerous efforts have been devoted to investigating different aspects of NFV, none of them specifically focused on NFV platforms or attempted to explore their design space. In this paper, we present a comprehensive survey on the NFV platform design. Our study solely targets existing NFV platform implementations. We begin with a top-down architectural view of the standard reference NFV platform and present our taxonomy of existing NFV platforms based on what features they provide in terms of a typical network function life cycle. Then we thoroughly explore the design space and elaborate on the implementation choices each platform opts for. We also envision future challenges for NFV platform design in the incoming 5G era. We believe that our study gives a detailed guideline for network operators or service providers to choose the most appropriate NFV platform based on their respective requirements. Our work also provides guidelines for implementing new NFV platforms

    LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI Applications

    Full text link
    A new pre-exascale computer cluster has been designed to foster scientific progress and competitive innovation across European research systems, it is called LEONARDO. This paper describes the general architecture of the system and focuses on the technologies adopted for its GPU-accelerated partition. High density processing elements, fast data movement capabilities and mature software stack collections allow the machine to run intensive workloads in a flexible and scalable way. Scientific applications from traditional High Performance Computing (HPC) as well as emerging Artificial Intelligence (AI) domains can benefit from this large apparatus in terms of time and energy to solution.Comment: 16 pages, 5 figures, 7 tables, to be published in Journal of Large Scale Research Facilitie

    Modelling of Information Flow and Resource Utilization in the EDGE Distributed Web System

    Get PDF
    The adoption of Distributed Web Systems (DWS) into modern engineering design process has dramatically increased in recent years. The Engineering Design Guide and Environment (EDGE) is one such DWS, intended to provide an integrated set of tools for use in the development of new products and services. Previous attempts to improve the efficiency and scalability of DWS focused largely on hardware utilization (i.e. multithreading and virtualization) and software scalability (i.e. load balancing and cloud services). However, these techniques are often limited to analysis of the computational complexity of the algorithms implemented. This work seeks to improve the understanding of efficiency and scalability of DWS by modelling the dynamics of information flow and resource utilization by characterizing DWS workloads through historical usage data (i.e. request type, frequency, access time). The design and implementation of EDGE is described. A DWS model of an EDGE system is developed and validated against theoretical limiting cases. The DWS model is used to predict the throughput of an EDGE system given a resource allocation and workflow. Results of the simulation suggest that proposed DWS designs can be evaluated according to the usage requirements of an engineering firm, ultimately guiding an informed decision for the selection and deployment of a DWS in an enterprise environment. Recommendations for future work related to the continued development of EDGE, DWS modelling of EDGE installation environments, and the extension of DWS modelling to new product development processes are presented

    LEONARDO: A Pan-European Pre-Exascale Supercomputer for HPC and AI applications

    Get PDF
    A new pre-exascale computer cluster has been designed to foster scientific progress and competitive innovation across European research systems, it is called LEONARDO. This paper describes thegeneral architecture of the system and focuses on the technologies adopted for its GPU-accelerated partition. High density processing elements, fast data movement capabilities and mature software stack collections allow the machine to run intensive workloads in a flexible and scalable way. Scientific applications from traditional High Performance Computing (HPC) as well as emerging Artificial Intelligence (AI) domains can benefit from this large apparatus in terms of time and energy to solution

    TCP/IP kiihdytys pilvipohjaisessa mobiiliverkossa

    Get PDF
    Mobile traffic rates are in constant growth. The currently used technology, long-term evolution (LTE), is already in a mature state and receives only small incremental improvements. However, a new major paradigm shift is needed to support future development. Together with the transition to the fifth generation of mobile telecommunications, companies are moving towards network function virtualization (NFV). By decoupling network functions from the hardware it is possible to achieve lower development and management costs as well as better scalability. Major change from dedicated hardware to the cloud does not take place without issues. One key challenge is building a telecommunications-grade ultra-low-latency and low-jitter data storage for call session data. Once overcome, it enables new ways to build much simpler stateless radio applications. There are many technologies which can be used to achieve lower latencies in the cloud infrastructure. In the future, technologies such as memory-centric computing can revolutionize the whole infrastructure and provide nanosecond latencies. However, on the short term, viable solutions are purely software-based. Examples of these are databases and transport layer protocols optimized for latency. Traffic processing can also be accelerated by using libraries and drivers such as the Data Plane Development Kit (DPDK). However, DPDK does not have transport layer support, so additional frameworks are needed to unleash the potential of Transmission Control Protocol/Internet Protocol (TCP/IP) acceleration. In this thesis TCP/IP acceleration is studied as a method for providing ultra-low-latency and low-jitter communications for call session data storage. Two major frameworks -- namely, VPP and F-Stack -- were selected for evaluation. The major finding is that the frameworks are not as mature as expected, and thus they failed to deliver production-ready performance. Building robust interface for applications to use was recognized as a common problem in the market.Mobiiliverkon datamäärät ovat jatkuvassa nousussa. Nykyisin käytössä olevaa neljännen sukupolven matkapuhelintekniikkaan (4G) tehdään enää pieniä päivityksiä. Jotta tulevaisuuden datamääriin pystytään vastaamaan, täytyy tekniikan ottaa seuraava suuri harppaus. Siirryttäessä viidennen sukupolven matkapuhelintekniikkaan (5G), siirtyvät yritykset myös kohti verkon funktioiden virtualisointia. Erottamalla verkon funktiot laitteistosta pystytään saavuttamaan entistä matalammat kehitys- ja hallintakustannukset, sekä parempi skaalautuvuus. Siirtymä erityislaitteistosta pilveen on haasteellinen. Yksi keskeisimmistä ongelmista on matalaan ja tasaiseen viiveeseen pystyvän tietovaraston rakentaminen yhteyksien käsittelyyn. Jos tähän haasteeseen pystytään vastaamaan, mahdollistaa se uudenlaisten yksinkertaisten tilattomien radioapplikaatioiden suunnittelun. Matalaa viivettä pystytään tavoittelemaan monella tapaa. Tulevaisuudessa muistikeskeinen laskenta saattaa mullistaa koko infrastruktuurin ja mahdollistaa nanosekuntien viiveet. Tämä ei kuitenkaan ole mahdollista lyhyellä mittakaavalla, joten ratkaisuja pitää etsiä ohjelmistoratkaisuista. Tämä tarkoittaa esimerkiksi tietokannan tai kuljetuskerroksen protokollan optimointia viivettä ajatellen. Lii-kenteen prosessointia voi myös kiihdyttää erilaisilla kirjastoilla ja ajureilla, kuten Data plane development kitillä (DPDK). DPDK ei tue kuljetuskerroksen kiihdytystä, joten tähän joudutaan käyttämään erillisiä ohjelmistoja. Tässä diplomityössä tutkitaan, pystyvätkö TCP/IP-kiihdytystä tarjoavat ohjelmointikehykset lyhentämään viivettä riittävästi yhteyksien tilannedatan varastoinnin tarpeisiin. Kahden yleisimmän ohjelmiston, VPP ja F-Stack, suorituskyky mitataan. Tutkimuksen tuloksena havaittiin, että kumpikaan ohjelmisto ei ole riit-tävän valmis tuotantokäyttöön. Yhteinen ongelma kaikissa tutkituissa ohjelmistoissa oli rajapinta, jota tarjotaan applikaation käytettäväksi

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Doctor of Philosophy

    Get PDF
    dissertationHigh-performance supercomputers on the Top500 list are commonly designed around commodity CPUs. Most of the codes executed on these machines are message-passing codes using the message-passing toolkit (MPI). Thus it makes sense to look at these machines from a holistic systems architecture perspective and consider optimizations to commodity processors that make them more efficient in message-passing architectures. Described herein is a new User-Level Notification (ULN) architecture that significantly improves message-passing performance. The architecture integrates a simultaneous multithreaded (SMT) processor with a user-level network interface (NI) that can directly control the execution scheduling of threads on the processor. By allowing the network interface to control the execution of message handling code at the user level, the operating system (OS) related overhead for handling interrupts and user code dispatch related to notifications is eliminated. By using an SMT processor, message handling can be performed in one thread concurrent to user computation in other threads, thus most of the overhead of executing message handlers can be hidden. This dissertation presents measurements showing the OS overheads related to message-passing are significant in modern architectures and describes a new architecture that significantly reduces these overheads. On a communication-intensive real-world application, the ULN architecture provides a 50.9% performance improvement over a more traditional OS-based NIC and a 5.29-31.9% improvement over a best-of-class user-level NIC due to the user-level notifications
    corecore