26 research outputs found

    High Speed Networking In The Multi-Core Era

    Get PDF
    High speed networking is a demanding task that has traditionally been performed in dedicated, purpose built hardware or specialized network processors. These platforms sacrifice flexibility or programmability in favor of performance. Recently, there has been much interest in using multi-core general purpose processors for this task, which have the advantage of being easily programmable and upgradeable. The best way to exploit these new architectures for networking is an open question that has been the subject of much recent research. In this dissertation, I explore the best way to exploit multi-core general purpose processors for packet processing applications. This includes both new architectural organizations for the processors as well as changes to the systems software. I intend to demonstrate the efficacy of these techniques by using them to build an open and extensible network security and monitoring platform that can out perform existing solutions

    Improving network intrusion detection system performance through quality of service configuration and parallel technology

    Get PDF
    This paper outlines an innovative software development that utilizes Quality of Service (QoS) and parallel technologies in Cisco Catalyst Switches to increase the analytical performance of a Network Intrusion Detection and Protection System (NIDPS) when deployed in highspeed networks. We have designed a real network to present experiments that use a Snort NIDPS. Our experiments demonstrate the weaknesses of NIDPSes, such as inability to process multiple packets and propensity to drop packets in heavy traffic and high-speed networks without analysing them. We tested Snort’s analysis performance, gauging the number of packets sent, analysed, dropped, filtered, injected, and outstanding. We suggest using QoS configuration technologies in a Cisco Catalyst 3560 Series Switch and parallel Snorts to improve NIDPS performance and to reduce the number of dropped packets. Our results show that our novel configuration improves performance

    Elastic techniques to handle dynamism in real-time data processing systems

    Get PDF
    Real-time data processing is a crucial component of cloud computing today. It is widely adopted to provide an up-to-date view of data for social networks, cloud management, web applications, edge, and IoT infrastructures. Real-time processing frameworks are designed for time-sensitive tasks such as event detection, real-time data analysis, and prediction. Compared to handling offline, batched data, real-time data processing applications tend to be long-running and are prone to performance issues caused by many unpredictable environmental variables, including (but not limited to) job specification, user expectation, and available resources. In order to cope with this challenge, it is crucial for system designers to improve frameworks’ ability to adjust their resource usage to adapt to changing environmental variables, defined as system elasticity. This thesis investigates how elastic resource provisioning helps cloud systems today process real-time data while maintaining predictable performance under workload influence in an automated manner. We explore new algorithms, framework design, and efficient system implementation to achieve this goal. On the other hand, distributed systems today need to continuously handle various application specifications, hardware configurations, and workload characteristics. Maintaining stable performance requires systems to explicitly plan for resource allocation upon starting an application and tailor allocation dynamically during run time. In this thesis, we show how achieving system elasticity can help systems provide tunable performance under the dynamism of many environmental variables without compromising resource efficiency. Specifically, this thesis focuses on the two following aspects: i) Elasticity-aware Scheduling: Real-time data processing systems today are often designed in resource-, workload-agnostic fashion. As a result, most users are unable to perform resource planning before launching an application or adjust resource allocation (both within and across application boundaries) intelligently during the run. The first part of this thesis work (Stela [1], Henge [2], Getafix [3]) explores efficient mechanisms to conduct performance analysis while also enabling elasticity-aware scheduling in today’s cloud frameworks. ii) Resource Efficient Cloud Stack: The second line of work in this thesis aims to improve underlying cloud stacks to support self-adaptive, highly efficient resource provisioning. Today’s cloud systems enforce full isolation that prevents resource sharing among applications at a fine granularity over time. This work (Cameo [4], Dirigo) builds real- time data processing systems for emerging cloud infrastructures with high resource utilization through fine-grained resource sharing. Given that the market for real-time data analysis is expected to increase by the annual rate of 28.2% and reach 35.5 billion by the year 2024 [5], improving system elasticity can introduce a significant reduction to deployment cost and increase in resource utilization. Our works improve the performances of real-time data analytics applications within resource constraints. We highlight some of the improvements as the following: i) Stela explores elastic techniques for single-tenant, on-demand dataflow scale-out and scale-in operations. It improves post-scale throughput by 45-120% during on-demand scale-out and post-scale throughput by 2-5× during on-demand scale-in. ii) Henge develops a mechanism to map application’s performance into a unified scale of resource needs. It reduces resource consumption by 40-60% by maintaining the same level of SLO achievement throughout the cluster. iii) Getafix implements a strategy to analyze workload dynamically and proposes a solution that guides the systems to calculate the number of replicas to generate and the placement plan of these replicas adaptively. It achieves comparable query latency (both average and tail) by achieving 1.45-2.15× memory savings. iv) Cameo proposes a scheduler that supports data-driven, fine-grained operator execution guided by user expectations. It improves cluster utilization by 6× and reduces the performance violation by 72% while compacting more jobs into a shared cluster. v) Dirigo performs fully decentralized, function state-aware, global message scheduling for stateful functions. It is able to reduce tail latency by 60% compared to the local scheduling approach and reduce remote state accesses by 19× compared to the scheduling approach that is unaware of function states. These works can potentially lead to profound cost savings for both cloud providers and end-users

    Méthodes efficaces de parallélisation de l'analyse de traces noyau

    Get PDF
    RÉSUMÉ Les architectures hautement parallèles sont de plus en plus répandues, non seulement dans les systèmes haute-performance, mais aussi dans les ordinateurs grand public. La détection et la résolution de problèmes lors de l’exécution parallèle de logiciels sur ces types de systèmes sont des tâches complexes auxquelles les outils classiques de débogage ne sont pas adaptés. Des études précédentes ont démontré que le traçage s’avère être une solution efficace à la résolution de problèmes dans des systèmes hautement parallèles. Cependant, l’augmentation du nombre d’unités parallèles dans les systèmes tracés cause aussi une augmentation de la quantité de données générées par le traçage. Les architectures distribuées ne font qu’exacerber ce problème puisque chaque nœud peut contenir plusieurs processeurs multicœurs. Les données de trace doivent être analysées par un outil d’analyse de traces afin de pouvoir extraire les métriques importantes qui permettront de résoudre les problèmes. Or, les outils d’analyse de traces disponibles sont conçus de manière à s’exécuter séquentiellement, sans tirer avantage des capacités d’exécution parallèle. Nous nous retrouvons donc face à une différence de plus en plus grande entre la quantité de données produite par le traçage et la vitesse à laquelle ces données peuvent être analysées. La présente recherche a pour but d’explorer l’utilisation du calcul parallèle afin d’accélérer l’analyse de traces. Nous proposons une méthode efficace de parallélisation de l’analyse de traces qui supporte la mise à l’échelle. Nous nous concentrons sur les traces en format CTF générées par le traceur LTTng sur Linux. La solution présentée prend en compte des facteurs clés de la parallélisation efficace, notamment un bon équilibrage de la charge, un minimum de synchronisation et une résolution efficace des dépendances de données. Notre solution se base sur des aspects clés du format de trace CTF afin de créer des charges de travail équilibrées et facilement parallélisables. Nous proposons aussi un algorithme permettant la détection et la résolution de dépendances de données, pendant l’analyse de traces, qui utilise au minimum le verrouillage et la synchronisation entre les fils d’exécution. Nous implémentons trois analyses de traces parallèles à l’aide de cette solution : la première permet de compter les événements d’une trace, la seconde de mesurer le temps CPU utilisé par processus et la troisième de mesurer la quantité de données lues et écrites par processus. Nous utilisons ces analyses afin de mesurer la mise à l’échelle possible de notre solution, en utilisant le concept d’efficacité parallèle. Puisque les traces peuvent être potentiellement très volumineuses, elles ne peuvent être gardées en mémoire et sont donc lues à partir du disque. Afin d’évaluer l’impact de la performance des périphériques de stockages sur notre implémentation parallèle, nous utilisons un programme simulant des charges de travail sur le CPU et sur le disque, typiques de l’analyse de traces. Nous évaluons ensuite la performance de ce programme sur plusieurs types de périphériques de stockage, tels que des disques durs et des disques SSD, afin de démontrer que la performance de l’analyse parallèle de traces n’est pas gravement limitée par les accès au disque, surtout avec des périphériques de stockage modernes. Nous utilisons aussi ce programme afin d’évaluer l’effet d’améliorations futures au décodage de la trace, sur la mise à l’échelle de l’analyse parallèle. Notre solution offre une efficacité parallèle au-dessus de 56% jusqu’à 32 cœurs, lors de l’exécution de l’analyse de traces parallèle, ce qui signifie une accélération de 18 fois par rapport au temps de traitement séquentiel. De plus, les résultats de performance obtenus à partir du programme de simulation confirment que l’efficacité parallèle n’est pas sérieusement affectée par les accès au disque lorsque des périphériques de type SSD sont utilisés. Cette observation tient d’ailleurs même lorsque le décodage de la trace est plus rapide. Certains facteurs qui nuisent à la mise à l’échelle sont dus au modèle séquentiel de la bibliothèque de lecture de traces et peuvent être réglés avec une refonte de celle-ci, tandis que d’autres proviennent de goulots d’étranglement au sein du module de gestion de la mémoire du noyau et pourraient être améliorés ou contournés.----------ABSTRACT Highly parallel computer architectures are now increasingly commonplace, whether in com- mercial or consumer-grade systems. Detecting and solving runtime problems, in software running in a parallel environment, is a complicated task, where classic debugging tools are of little help. Previous research has shown that tracing offers an efficient and scalable way to resolve these problems. However, as the number of parallel units in the traced system increases, so does the amount of data generated in the trace. This problem also compounds when tracing distributed systems, where each individual node may have many-core processors. Trace data has to be analyzed by a trace analysis tool, in order to extract significant metrics which can be used to resolve problems. However, the current trace analysis tools are designed for serial analysis on a single thread. We therefore have an ever widening gap between the amount of data produced in the trace and the speed at which we can analyse this data. This research explores the use of parallel processing in order to accelerate trace analysis. The aim is to develop an efficient and scalable parallel method for analyzing traces. We focus on traces in the CTF format, generated by the LTTng tracer on Linux. We present a solution which takes into account key factors of parallelization, such as good load balancing, low synchronization overhead and an efficient resolution of data dependencies. Our solution uses key aspects of the CTF trace format to create balanced, parallelizable workloads. We also propose an algorithm to detect and resolve data dependencies during trace analysis, with minimal locking and synchronization. Using this solution, we implement three trace analyses (counting events; measuring CPU time per-process; measuring amount of data read and written per-process) which we use in order to assess the scalability in terms of parallel efficiency. Traces, being potentially very large, are not kept entirely in memory and must be read from disk. In order to assess the effect of the speed of storage devices on the parallel implementation of trace analysis, we create a program that simulates the CPU and I/O workloads typical of trace analysis. We then benchmark this program on various storage devices (e.g. HDD, SSD, etc.) in order to show that parallel trace analysis is not seriously hindered by I/O-boundedness problems, especially with modern storage hardware. We also use this program in order to assess the effect of future improvements in trace decoding on the analysis. Our solution shows parallel efficiency above 56% up to 32 cores, when running the parallel trace analyses, which translates to a speedup of 18 times the serial speed. Furthermore, benchmarks on the simulation program confirm that these efficiencies are not seriously affected by disk I/O on solid state devices, even in the case of faster trace decoding. Some factors affecting scalability are found within the serial design of the tracing library and can be fixed by a re-design, while others come from bottlenecks within the memory management unit of the kernel which could be improved or worked around

    Diverse Intrusion-tolerant Systems

    Get PDF
    Over the past 20 years, there have been indisputable advances on the development of Byzantine Fault-Tolerant (BFT) replicated systems. These systems keep operational safety as long as at most f out of n replicas fail simultaneously. Therefore, in order to maintain correctness it is assumed that replicas do not suffer from common mode failures, or in other words that replicas fail independently. In an adversarial setting, this requires that replicas do not include similar vulnerabilities, or otherwise a single exploit could be employed to compromise a significant part of the system. The thesis investigates how this assumption can be substantiated in practice by exploring diversity when managing the configurations of replicas. The thesis begins with an analysis of a large dataset of vulnerability information to get evidence that diversity can contribute to failure independence. In particular, we used the data from a vulnerability database to devise strategies for building groups of n replicas with different Operating Systems (OS). Our results demonstrate that it is possible to create dependable configurations of OSes, which do not share vulnerabilities over reasonable periods of time (i.e., a few years). Then, the thesis proposes a new design for a firewall-like service that protects and regulates the access to critical systems, and that could benefit from our diversity management approach. The solution provides fault and intrusion tolerance by implementing an architecture based on two filtering layers, enabling efficient removal of invalid messages at early stages in order to decrease the costs associated with BFT replication in the later stages. The thesis also presents a novel solution for managing diverse replicas. It collects and processes data from several data sources to continuously compute a risk metric. Once the risk increases, the solution replaces a potentially vulnerable replica by another one, trying to maximize the failure independence of the replicated service. Then, the replaced replica is put on quarantine and updated with the available patches, to be prepared for later re-use. We devised various experiments that show the dependability gains and performance impact of our prototype, including key benchmarks and three BFT applications (a key-value store, our firewall-like service, and a blockchain).Unidade de investigação LASIGE (UID/CEC/00408/2019) e o projeto PTDC/EEI-SCR/1741/2041 (Abyss

    SoK: Understanding BFT Consensus in the Age of Blockchains

    Get PDF
    Blockchain as an enabler to current Internet infrastructure has provided many unique features and revolutionized current distributed systems into a new era. Its decentralization, immutability, and transparency have attracted many applications to adopt the design philosophy of blockchain and customize various replicated solutions. Under the hood of blockchain, consensus protocols play the most important role to achieve distributed replication systems. The distributed system community has extensively studied the technical components of consensus to reach agreement among a group of nodes. Due to trust issues, it is hard to design a resilient system in practical situations because of the existence of various faults. Byzantine fault-tolerant (BFT) state machine replication (SMR) is regarded as an ideal candidate that can tolerate arbitrary faulty behaviors. However, the inherent complexity of BFT consensus protocols and their rapid evolution makes it hard to practically adapt themselves into application domains. There are many excellent Byzantine-based replicated solutions and ideas that have been contributed to improving performance, availability, or resource efficiency. This paper conducts a systematic and comprehensive study on BFT consensus protocols with a specific focus on the blockchain era. We explore both general principles and practical schemes to achieve consensus under Byzantine settings. We then survey, compare, and categorize the state-of-the-art solutions to understand BFT consensus in detail. For each representative protocol, we conduct an in-depth discussion of its most important architectural building blocks as well as the key techniques they used. We aim that this paper can provide system researchers and developers a concrete view of the current design landscape and help them find solutions to concrete problems. Finally, we present several critical challenges and some potential research directions to advance the research on exploring BFT consensus protocols in the age of blockchains

    19th SC@RUG 2022 proceedings 2021-2022

    Get PDF
    corecore