19 research outputs found

    Design of large polyphase filters in the Quadratic Residue Number System

    Full text link

    Advanced techniques for multi-variant execution

    Get PDF

    Temperature aware power optimization for multicore floating-point units

    Full text link

    Improving Network Performance Through Endpoint Diagnosis And Multipath Communications

    Get PDF
    Components of networks, and by extension the internet can fail. It is, therefore, important to find the points of failure and resolve existing issues as quickly as possible. Resolution, however, takes time and its important to maintain high quality of service (QoS) for existing clients while it is in progress. In this work, our goal is to provide clients with means of avoiding failures if/when possible to maintain high QoS while enabling them to assist in the diagnosis process to speed up the time to recovery. Fixing failures relies on first detecting that there is one and then identifying where it occurred so as to be able to remedy it. We take a two-step approach in our solution. First, we identify the entity (Client, Server, Network) responsible for the failure. Next, if a failure is identified as network related additional algorithms are triggered to detect the device responsible. To achieve the first step, we revisit the question: how much can you infer about a failure using TCP statistics collected at one of the endpoints in a connection? Using an agent that captures TCP statistics at one of the end points we devise a classification algorithm that identifies the root cause of failures. Using insights derived from this classification algorithm we identify dominant TCP metrics that indicate where/why problems occur. If/when a failure is identified as a network related problem, the second step is triggered, where the algorithm uses additional information that is collected from ``failed\u27\u27 connections to identify the device which resulted in the failure. Failures are also disruptive to user\u27s performance. Resolution may take time. Therefore, it is important to be able to shield clients from their effects as much as possible. One option for avoiding problems resulting from failures is to rely on multiple paths (they are unlikely to go bad at the same time). The use of multiple paths involves both selecting paths (routing) and using them effectively. The second part of this thesis explores the efficacy of multipath communication in such situations. It is expected that multi-path communications have monetary implications for the ISP\u27s and content providers. Our solution, therefore, aims to minimize such costs to the content providers while significantly improving user performance

    Dynamic shared memory architecture, systems, and optimizations for high performance and secure virtualized cloud

    Get PDF
    Dynamic memory consolidation is an important enabler for high performance virtual machine (VM) execution in virtualized Cloud. Efficient just-in-time memory balancing requires three core capabilities: (i) Detecting memory pressure across VMs hosted on a physical machine; (ii) Allocation of memory to respective VMs; (iii) Enabling fast recovery upon making newly allocated memory available at the high pressure VMs. Although the Balloon driver technology facilitates the second task, it remains difficult to accurately predict the VM memory demands at affordable overhead, especially under unpredictable and changing workloads. Furthermore, no prior study analyzed the effect of slow response of VM execution to the newly available memory due to paging based application recovery. In this dissertation research, I have made four original contributions to dynamic shared memory management in terms of architecture, systems and optimizations for improving VM execution performance and security. First, we designed and developed MemPipe, a shared memory inter-VM communication channel for fast inter-VM network I/O. MemPipe increases the shared memory utilization by adaptively adjusting the shared memory size according to workloads demands. It also reduces the inter-VM network communication overhead by directly copying the packets from the sender VM's user space to the shared memory area. Second, we developed iBalloon, a light-weight and transparent prediction based facility to enable automated or semi-automated ballooning with more customizable, accurate, and efficient memory balancing policies among VMs. Third, we developed MemFlex, a novel shared memory swapping facility that can effectively utilizes host idle memory by a hybrid memory swap-out model and a fast swap-in optimization. Fourth, we introduced SecureStack, which is a kernel backed tool to prevent the sensitive data on the function stack from being illegally accessed by the untrusted functions. SecureStack introduces three procedures to protect, restore, and clear the stack in a reliable and low cost manner. It is highly transparent to the users and does not bring any new vulnerability to the existing system. The above research developments are packaged into MemLego, a new memory management framework for memory-centric computing in the big data era.Ph.D

    High Performance Transaction Processing on Non-Uniform Hardware Topologies

    Get PDF
    Transaction processing is a mission critical enterprise application that runs on high-end servers. Traditionally, transaction processing systems have been designed for uniform core-to-core communication latencies. In the past decade, with the emergence of multisocket multicores, for the first time we have Islands, i.e., groups of cores that communicate fast among themselves and slower with other groups. In current mainstream servers, each multicore processor corresponds to an Island. As the number of cores on a chip increases, however, we expect that multiple Islands will form within a single processor in the nearby future. In addition, the access latencies to the local memory and to the memory of another server over fast interconnect are converging, thus creating a hierarchy of Islands within a group of servers. Non-uniform hardware topologies pose a significant challenge to the scalability and the predictability of performance of transaction processing systems. Distributed transaction processing systems can alleviate this problem; however, no single deployment configuration is optimal for all workloads and hardware topologies. In order to fully utilize the available processing power, a transaction processing system needs to adapt to the underlying hardware topology and tune its configuration to the current workload. More specifically, the system should be able to detect any changes to the workload and hardware topology, and adapt accordingly without disrupting the processing. In this thesis, we first systematically quantify the impact of hardware Islands on deployment configurations of distributed transaction processing systems. We show that none of these configurations is optimal for all workloads, and the choice of the optimal configuration depends on the combination of the workload and hardware topology. In the cluster setting, on the other hand, the choice of optimal configuration additionally depends on the properties of the communication channel between the servers. We address this challenge by designing a dynamic shared-everything system that adapts its data structures automatically to hardware Islands. To ensure good performance in the presence of shifting workload patterns, we use a lightweight partitioning and placement mechanism to balance the load and minimize the synchronization overheads across Islands. Overall, we show that masking the non-uniformity of inter-core communication is critical for achieving predictably high performance for latency-sensitive applications, such as transaction processing. With clusters of a handful of multicore chips with large main memories replacing high-end many-socket servers, the deployment rules of thumb identified in our analysis have a potential to significantly reduce the synchronization and communication costs of transaction processing. As workloads become more dynamic and diverse, while still running on partitioned infrastructure, the lightweight monitoring and adaptive repartitioning mechanisms proposed in this thesis will be applicable to a wide range of designs for which traditional offline schemes are impractical

    Datacenter Architectures for the Microservices Era

    Full text link
    Modern internet services are shifting away from single-binary, monolithic services into numerous loosely-coupled microservices that interact via Remote Procedure Calls (RPCs), to improve programmability, reliability, manageability, and scalability of cloud services. Computer system designers are faced with many new challenges with microservice-based architectures, as individual RPCs/tasks are only a few microseconds in most microservices. In this dissertation, I seek to address the most notable challenges that arise due to the dissimilarities of the modern microservice based and classic monolithic cloud services, and design novel server architectures and runtime systems that enable efficient execution of µs-scale microservices on modern hardware. In the first part of my dissertation, I seek to address the problem of Killer Microseconds, which refers to µs-scale “holes” in CPU schedules caused by stalls to access fast I/O devices or brief idle times between requests in high throughput µs-scale microservices. Whereas modern computing platforms can efficiently hide ns-scale and ms-scale stalls through micro-architectural techniques and OS context switching, they lack efficient support to hide the latency of µs-scale stalls. In chapter II, I propose Duplexity, a heterogeneous server architecture that employs aggressive multithreading to hide the latency of killer microseconds, without sacrificing the Quality-of-Service (QoS) of latency-sensitive microservices. Duplexity is able to achieve 1.9× higher core utilization and 2.7× lower iso-throughput 99th-percentile tail latency over an SMT-based server design, on average. In chapters III-IV, I comprehensively investigate the problem of tail latency in the context of microservices and address multiple aspects of it. First, in chapter III, I characterize the tail latency behavior of microservices and provide general guidelines for optimizing computer systems from a queuing perspective to minimize tail latency. Queuing is a major contributor to end-to-end tail latency, wherein nominal tasks are enqueued behind rare, long ones, due to Head-of-Line (HoL) blocking. Next, in chapter IV, I introduce Q-Zilla, a scheduling framework to tackle tail latency from a queuing perspective, and CoreZilla, a microarchitectural instantiation of the framework. Q-Zilla is composed of the ServerQueue Decoupled Size-Interval Task Assignment (SQD-SITA) scheduling algorithm and the Express-lane Simultaneous Multithreading (ESMT) microarchitecture, which together seek to address HoL blocking by providing an “express-lane” for short tasks, protecting them from queuing behind rare, long ones. By combining the ESMT microarchitecture and the SQD-SITA scheduling algorithm, CoreZilla is able to improves tail latency over a conventional SMT core with 2, 4, and 8 contexts by 2.25×, 3.23×, and 4.38×, on average, respectively, and outperform a theoretical 32-core scale-up organization by 12%, on average, with 8 contexts. Finally, in chapters V-VI, I investigate the tail latency problem of microservices from a cluster, rather than server-level, perspective. Whereas Service Level Objectives (SLOs) define end-to-end latency targets for the entire service to ensure user satisfaction, with microservice-based applications, it is unclear how to scale individual microservices when end-to-end SLOs are violated or underutilized. I introduce Parslo as an analytical framework for partial SLO allocation in virtualized cloud microservices. Parslo takes a microservice graph as an input and employs a Gradient Descent-based approach to allocate “partial SLOs” to different microservice nodes, enabling independent auto-scaling of individual microservices. Parslo achieves the optimal solution, minimizing the total cost for the entire service deployment, and is applicable to general microservice graphs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167978/1/miramir_1.pd

    Scaling Up Concurrent Analytical Workloads on Multi-Core Servers

    Get PDF
    Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience

    A generic software architecture for portable applications in heterogeneous wireless sensor networks

    Get PDF
    In the last years, wireless sensor networks (WSNs) are acquiring more importance as a promising technology based on tiny devices called sensor nodes or motes able to monitor a wide range of physical phenomenon through sensors. Numerous branches of science are being benefited. The intrinsic ubiquity of sensor nodes and the absence of network infrastructure make possible their deployment in hostile or, up to now, unknown environments which have been typically unaccessible for humans such as volcanos or glaciers, providing precise and up-to-date data. As potential applications continue arising, both new technical and conceptual challenges appear. The severe hardware restrictions of sensor nodes in relation to computation, communication and specifically, energy, have posed new and exciting requirements. In particular, research is moving towards heterogeneous networks that will contain different devices running custom WSN operating systems. Operating systems specifically designed for sensor nodes are intended to efficiently manage the hardware resources and facilitate the programming. Nevertheless, they often lack the generality and the high-level abstractions expected at this abstraction layer. Consequently, they do not completely hide either the underlying platform or its execution model, making the applications programming close to operating system and thus reducing the portability. This thesis focuses on the portability of applications in heterogeneous wireless sensor networks. To contribute to this important challenge the thesis proposes a generic software architecture based on sensor node, which supports the process of applications development by homogenizing and facilitating the access to different WSN operating systems. Specifically, the next main objectives have been established. * Designing and implementing a generic sensor node-centric architecture distinguishing clearly the different abstraction levels in a sensor node. The architecture should be flexible enough in order to incorporate high-level abstractions which facilitate the the programming. * As part of the architecture, constructing an intermediate layer between applications and sensor node operating system. This layer is intended to abstract away the operating system by demultiplexing a set of homogeneous services and mapping them into operating system-specific requests. To achieve this, programming language extensions have to be also specified on top of the architecture, in order to write portable applications. In this way, platform-specific code can be generated from these high-level applications for di erent sensor node platforms. In this way, architecture deals with the problem of heterogeneity and portability. * Evaluating the feasibility of incorporating the abstractions above mentioned within the development process in terms of portability, efficiency and productivity. In this environment the footprint is a specially critical issue, due to the hardware limitations. In fact, an excessive overhead of applications size could make prohibitive the proposed solution. The thesis presents a generic software architecture for portable applications in heterogeneous wireless sensor networks. The proposed solution and its evaluation is described in this document. Theoretical and practical contributions of this thesis and the main future research directions are also presented.-------------------------------------------------------------------------------------------------------------------------En los últimos años, las redes de sensores inalámbricas han adquirido cada vez mayor protagonismo y se han erigido como una prometedora tecnología basada en dispositivos pequeños denominados nodos sensores o motes, que son capaces de monitorizar fenómenos físicos a través de diferentes sensores. Un gran número de diferentes ramas de las ciencias podrían verse beneficiadas. La naturaleza ubicua de los nodos además de la ausencia de una infraestructura de red, hacen posible la instalación de estas redes en terrenos inhóspitos y típicamente inaccesibles para los seres humanos, como por ejemplo glaciares o volcanes, para proporcionar un conocimiento preciso y actualizado. A medida que continúan apareciendo diferentes aplicaciones potenciales, surgen nuevos retos tanto técnicos como conceptuales. Las restricciones severas de los recursos en términos de cómputo, comunicación y, sobre todo, energía, plantean nuevos requerimientos. En particular, la investigación tiende a crear redes heterogéneas que incluyen diferentes dispositivos de hardware e integran sistemas operativos desarrollados ad-hoc. Los sistemas operativos específicamente diseñados para nodos sensores han sido concebidos para gestionar eficientemente sus recursos de hardware y facilitar la programación. Sin embargo, a menudo carecen de la generalidad y de las abstracciones de alto nivel esperadas en esta capa de abstracción. Por tanto, los sistemas operativos no enmascaran completamente su modelo de ejecución ni la plataforma subyacente, convirtiendo la programación de aplicaciones en fuertemente acoplada al sistema operativo y, consecuentemente, reduciendo la portabilidad. Esta tesis se centra en la portabilidad de aplicaciones en redes de sensores inalámbricas heterogéneas. Con el objeto de contribuir a este relevante ámbito de estudio, la tesis propone una arquitectura de software genérica basada en nodo sensor, la cual soporta el proceso de desarrollo de aplicaciones homogeneizando y facilitando el acceso a diferentes sistemas operativos de nodos sensores. Específicamente, se han establecido los siguientes objetivos principales: * Diseñar e implementar una arquitectura genérica de nodo sensor distinguiendo con claridad los diferentes niveles de abstracción del nodo sensor. La arquitectura propuesta debería ser flexible para poder incorporar nuevas abstracciones de alto nivel que faciliten la programación de las aplicaciones. * Como parte de la arquitectura, deberá construirse una capa de abstracción localizada entre las aplicaciones y el sistema operativo. Su objetivo es abstraer el sistema operativo subyacente mediante un conjunto de servicios homogéneos que puedan ser mapeados en servicios específicos del sistema operativo. Para ello se deberá especificar en la capa superior de la arquitectura el conjunto de extensiones del lenguaje de programación que permitan escribir aplicaciones portables. Consecuentemente, el código específico de la plataforma puede ser generado a partir de las aplicaciones de alto nivel para diferentes plataformas de nodos sensores. De esta manera, la arquitectura trata los problemas de portabilidad y heterogeneidad en la construcción de aplicaciones. * Evaluar la factibilidad de incorporar las abstracciones previamente mencionadas para ser usadas dentro del proceso de desarrollo de aplicaciones, en términos de portabilidad, eficiencia y productividad. En el entorno de las redes de sensores, el consumo eficiente de los recursos de hardware es un aspecto crítico debido al presupuesto limitado del hardware. De hecho, una sobrecarga excesiva haría prohibitiva e inviable la propuesta. Esta tesis describe una arquitectura de software genérica para aplicaciones portables en redes de sensores inalámbricas heterogéneas. La solución propuesta y su evaluación se presentan en este documento. Las contribuciones teóricas y prácticas de esta tesis serán analizadas, así como las líneas futuras de investigación que derivan de este trabajo
    corecore