31 research outputs found

    Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

    Full text link
    We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.Comment: Submitted to IEEE Transactions on Knowledge and Data Engineering on 07/2020. Revised on 05/202

    Achieving Energy Efficiency on Networking Systems with Optimization Algorithms and Compressed Data Structures

    Get PDF
    To cope with the increasing quantity, capacity and energy consumption of transmission and routing equipment in the Internet, energy efficiency of communication networks has attracted more and more attention from researchers around the world. In this dissertation, we proposed three methodologies to achieve energy efficiency on networking devices: the NP-complete problems and heuristics, the compressed data structures, and the combination of the first two methods. We first consider the problem of achieving energy efficiency in Data Center Networks (DCN). We generalize the energy efficiency networking problem in data centers as optimal flow assignment problems, which is NP-complete, and then propose a heuristic called CARPO, a correlation-aware power optimization algorithm, that dynamically consolidate traffic flows onto a small set of links and switches in a DCN and then shut down unused network devices for power savings. We then achieve energy efficiency on Internet routers by using the compressive data structure. A novel data structure called the Probabilistic Bloom Filter (PBF), which extends the classical bloom filter into the probabilistic direction, so that it can effectively identify heavy hitters with a small memory foot print to reduce energy consumption of network measurement. To achieve energy efficiency on Wireless Sensor Networks (WSN), we developed one data collection protocol called EDAL, which stands for Energy-efficient Delay-aware Lifetime-balancing data collection. Based on the Open Vehicle Routing problem, EDAL exploits the topology requirements of Compressive Sensing (CS), then implement CS to save more energy on sensor nodes

    Change Management Systems for Seamless Evolution in Data Centers

    Get PDF
    Revenue for data centers today is highly dependent on the satisfaction of their enterprise customers. These customers often require various features to migrate their businesses and operations to the cloud. Thus, clouds today introduce new features at a swift pace to onboard new customers and to meet the needs of existing ones. This pace of innovation continues to grow on super linearly, e.g., Amazon deployed 1400 new features in 2017. However, such a rapid pace of evolution adds complexities both for users and the cloud. Clouds struggle to keep up with the deployment speed, and users struggle to learn which features they need and how to use them. The pace of these evolutions has brought us to a tipping point: we can no longer use rules of thumb to deploy new features, and customers need help to identify which features they need. We have built two systems: Janus and Cherrypick, to address these problems. Janus helps data center operators roll out new changes to the data center network. It automatically adapts to the data center topology, routing, traffic, and failure settings. The system reduces the risk of new deployments for network operators as they can now pick deployment strategies which are less likely to impact users’ performance. Cherrypick finds near-optimal cloud configurations for big data analytics. It adapts allows users to search through the new machine types the clouds are constantly introducing and find ones with a near-optimal performance that meets their budget. Cherrypick can adapt to new big-data frameworks and applications as well as the new machine types the clouds are constantly introducing. As the pace of cloud innovations increases, it is critical to have tools that allow operators to deploy new changes as well as those that would enable users to adapt to achieve good performance at low cost. The tools and algorithms discussed in this thesis help accomplish these goals

    Rethinking Routing and Peering in the era of Vertical Integration of Network Functions

    Get PDF
    Content providers typically control the digital content consumption services and are getting the most revenue by implementing an all-you-can-eat model via subscription or hyper-targeted advertisements. Revamping the existing Internet architecture and design, a vertical integration where a content provider and access ISP will act as unibody in a sugarcane form seems to be the recent trend. As this vertical integration trend is emerging in the ISP market, it is questionable if existing routing architecture will suffice in terms of sustainable economics, peering, and scalability. It is expected that the current routing will need careful modifications and smart innovations to ensure effective and reliable end-to-end packet delivery. This involves new feature developments for handling traffic with reduced latency to tackle routing scalability issues in a more secure way and to offer new services at cheaper costs. Considering the fact that prices of DRAM or TCAM in legacy routers are not necessarily decreasing at the desired pace, cloud computing can be a great solution to manage the increasing computation and memory complexity of routing functions in a centralized manner with optimized expenses. Focusing on the attributes associated with existing routing cost models and by exploring a hybrid approach to SDN, we also compare recent trends in cloud pricing (for both storage and service) to evaluate whether it would be economically beneficial to integrate cloud services with legacy routing for improved cost-efficiency. In terms of peering, using the US as a case study, we show the overlaps between access ISPs and content providers to explore the viability of a future in terms of peering between the new emerging content-dominated sugarcane ISPs and the healthiness of Internet economics. To this end, we introduce meta-peering, a term that encompasses automation efforts related to peering – from identifying a list of ISPs likely to peer, to injecting control-plane rules, to continuous monitoring and notifying any violation – one of the many outcroppings of vertical integration procedure which could be offered to the ISPs as a standalone service

    A Macroscopic Study of Network Security Threats at the Organizational Level.

    Full text link
    Defenders of today's network are confronted with a large number of malicious activities such as spam, malware, and denial-of-service attacks. Although many studies have been performed on how to mitigate security threats, the interaction between attackers and defenders is like a game of Whac-a-Mole, in which the security community is chasing after attackers rather than helping defenders to build systematic defensive solutions. As a complement to these studies that focus on attackers or end hosts, this thesis studies security threats from the perspective of the organization, the central authority that manages and defends a group of end hosts. This perspective provides a balanced position to understand security problems and to deploy and evaluate defensive solutions. This thesis explores how a macroscopic view of network security from an organization's perspective can be formed to help measure, understand, and mitigate security threats. To realize this goal, we bring together a broad collection of reputation blacklists. We first measure the properties of the malicious sources identified by these blacklists and their impact on an organization. We then aggregate the malicious sources to Internet organizations and characterize the maliciousness of organizations and their evolution over a period of two and half years. Next, we aim to understand the cause of different maliciousness levels in different organizations. By examining the relationship between eight security mismanagement symptoms and the maliciousness of organizations, we find a strong positive correlation between mismanagement and maliciousness. Lastly, motivated by the observation that there are organizations that have a significant fraction of their IP addresses involved in malicious activities, we evaluate the tradeoff of one type of mitigation solution at the organization level --- network takedowns.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116714/1/jingzj_1.pd

    On the dynamics of interdomain routing in the Internet

    Full text link
    The routes used in the Internet's interdomain routing system are a rich information source that could be exploited to answer a wide range of questions.  However, analyzing routes is difficult, because the fundamental object of study is a set of paths. In this dissertation, we present new analysis tools -- metrics and methods -- for analyzing paths, and apply them to study interdomain routing in the Internet over long periods of time. Our contributions are threefold. First, we build on an existing metric (Routing State Distance) to define a new metric that allows us to measure the similarity between two prefixes with respect to the state of the global routing system. Applying this metric over time yields a measure of how the set of paths to each prefix varies at a given timescale. Second, we present PathMiner, a system to extract large scale routing events from background noise and identify the AS (Autonomous System) or AS-link most likely responsible for the event. PathMiner is distinguished from previous work in its ability to identify and analyze large-scale events that may re-occur many times over long timescales. We show that it is scalable, being able to extract significant events from multiple years of routing data at a daily granularity. Finally, we equip Routing State Distance with a new set of tools for identifying and characterizing unusually-routed ASes. At the micro level, we use our tools to identify clusters of ASes that have the most unusual routing at each time. We also show that analysis of individual ASes can expose business and engineering strategies of the organizations owning the ASes.  These strategies are often related to content delivery or service replication. At the macro level, we show that the set of ASes with the most unusual routing defines discernible and interpretable phases of the Internet's evolution. Furthermore, we show that our tools can be used to provide a quantitative measure of the "flattening" of the Internet

    Improving the accuracy of spoofed traffic inference in inter-domain traffic

    Get PDF
    Ascertaining that a network will forward spoofed traffic usually requires an active probing vantage point in that network, effectively preventing a comprehensive view of this global Internet vulnerability. We argue that broader visibility into the spoofing problem may lie in the capability to infer lack of Source Address Validation (SAV) compliance from large, heavily aggregated Internet traffic data, such as traffic observable at Internet Exchange Points (IXPs). The key idea is to use IXPs as observatories to detect spoofed packets, by leveraging Autonomous System (AS) topology knowledge extracted from Border Gateway Protocol (BGP) data to infer which source addresses should legitimately appear across parts of the IXP switch fabric. In this thesis, we demonstrate that the existing literature does not capture several fundamental challenges to this approach, including noise in BGP data sources, heuristic AS relationship inference, and idiosyncrasies in IXP interconnec- tivity fabrics. We propose Spoofer-IX, a novel methodology to navigate these challenges, leveraging Customer Cone semantics of AS relationships to guide precise classification of inter-domain traffic as In-cone, Out-of-cone ( spoofed ), Unverifiable, Bogon, and Unas- signed. We apply our methodology on extensive data analysis using real traffic data from two distinct IXPs in Brazil, a mid-size and a large-size infrastructure. In the mid-size IXP with more than 200 members, we find an upper bound volume of Out-of-cone traffic to be more than an order of magnitude less than the previous method inferred on the same data, revealing the practical importance of Customer Cone semantics in such analysis. We also found no significant improvement in deployment of SAV in networks using the mid-size IXP between 2017 and 2019. In hopes that our methods and tools generalize to use by other IXPs who want to avoid use of their infrastructure for launching spoofed-source DoS attacks, we explore the feasibility of scaling the system to larger and more diverse IXP infrastructures. To promote this goal, and broad replicability of our results, we make the source code of Spoofer-IX publicly available. This thesis illustrates the subtleties of scientific assessments of operational Internet infrastructure, and the need for a community focus on reproducing and repeating previous methods.A constatação de que uma rede encaminhará tráfego falsificado geralmente requer um ponto de vantagem ativo de medição nessa rede, impedindo efetivamente uma visão abrangente dessa vulnerabilidade global da Internet. Isto posto, argumentamos que uma visibilidade mais ampla do problema de spoofing pode estar na capacidade de inferir a falta de conformidade com as práticas de Source Address Validation (SAV) a partir de dados de tráfego da Internet altamente agregados, como o tráfego observável nos Internet Exchange Points (IXPs). A ideia chave é usar IXPs como observatórios para detectar pacotes falsificados, aproveitando o conhecimento da topologia de sistemas autônomos extraído dos dados do protocolo BGP para inferir quais endereços de origem devem aparecer legitimamente nas comunicações através da infra-estrutura de um IXP. Nesta tese, demonstramos que a literatura existente não captura diversos desafios fundamentais para essa abordagem, incluindo ruído em fontes de dados BGP, inferência heurística de relacionamento de sistemas autônomos e características específicas de interconectividade nas infraestruturas de IXPs. Propomos o Spoofer-IX, uma nova metodologia para superar esses desafios, utilizando a semântica do Customer Cone de relacionamento de sistemas autônomos para guiar com precisão a classificação de tráfego inter-domínio como In-cone, Out-of-cone ( spoofed ), Unverifiable, Bogon, e Unassigned. Aplicamos nossa metodologia em análises extensivas sobre dados reais de tráfego de dois IXPs distintos no Brasil, uma infraestrutura de médio porte e outra de grande porte. No IXP de tamanho médio, com mais de 200 membros, encontramos um limite superior do volume de tráfego Out-of-cone uma ordem de magnitude menor que o método anterior inferiu sob os mesmos dados, revelando a importância prática da semântica do Customer Cone em tal análise. Além disso, não encontramos melhorias significativas na implantação do Source Address Validation (SAV) em redes usando o IXP de tamanho médio entre 2017 e 2019. Na esperança de que nossos métodos e ferramentas sejam aplicáveis para uso por outros IXPs que desejam evitar o uso de sua infraestrutura para iniciar ataques de negação de serviço através de pacotes de origem falsificada, exploramos a viabilidade de escalar o sistema para infraestruturas IXP maiores e mais diversas. Para promover esse objetivo e a ampla replicabilidade de nossos resultados, disponibilizamos publicamente o código fonte do Spoofer-IX. Esta tese ilustra as sutilezas das avaliações científicas da infraestrutura operacional da Internet e a necessidade de um foco da comunidade na reprodução e repetição de métodos anteriores

    Sketching as a Tool for Efficient Networked Systems

    Get PDF
    Today, computer systems need to cope with the explosive growth of data in the world. For instance, in data-center networks, monitoring systems are used to measure traffic statistics at high speed; and in financial technology companies, distributed processing systems are deployed to support graph analytics. To fulfill the requirements of handling such large datasets, we build efficient networked systems in a distributed manner most of the time. Ideally, we expect the systems to meet service-level objectives (SLOs) using the least amount of resource. However, existing systems constructed with conventional in-memory algorithms face the following challenges: (1) excessive resource requirements (e.g., CPU, ASIC, and memory) with high cost; (2) infeasibility in a larger scale; (3) processing the data too slowly to meet the objectives. To address these challenges, we propose sketching techniques as a tool to build more efficient networked systems. Sketching algorithms aim to process the data with one or several passes in an online, streaming fashion (e.g., a stream of network packets), and compute highly accurate results. With sketching, we only maintain a compact summary of the entire data and provide theoretical guarantees on error bounds. This dissertation argues for a sketching based design for large-scale networked systems, and demonstrates the benefits in three application contexts: (i) Network monitoring: we build generic monitoring frameworks that support a range of applications on both software and hardware with universal sketches. (ii) Graph pattern mining: we develop a swift, approximate graph pattern miner that scales to very large graphs by leveraging graph sketching techniques. (iii) Halo finding in N-body simulations: we design scalable halo finders on CPU and GPU by leveraging sketch-based heavy hitter algorithms

    Provider and peer selection in the evolving internet ecosystem

    Get PDF
    The Internet consists of thousands of autonomous networks connected together to provide end-to-end reachability. Networks of different sizes, and with different functions and business objectives, interact and co-exist in the evolving "Internet Ecosystem". The Internet ecosystem is highly dynamic, experiencing growth (birth of new networks), rewiring (changes in the connectivity of existing networks), as well as deaths (of existing networks). The dynamics of the Internet ecosystem are determined both by external "environmental" factors (such as the state of the global economy or the popularity of new Internet applications) and the complex incentives and objectives of each network. These dynamics have major implications on how the future Internet will look like. How does the Internet evolve? What is the Internet heading towards, in terms of topological, performance, and economic organization? How do given optimization strategies affect the profitability of different networks? How do these strategies affect the Internet in terms of topology, economics, and performance? In this thesis, we take some steps towards answering the above questions using a combination of measurement and modeling approaches. We first study the evolution of the Autonomous System (AS) topology over the last decade. In particular, we classify ASes and inter-AS links according to their business function, and study separately their evolution over the last 10 years. Next, we focus on enterprise customers and content providers at the edge of the Internet, and propose algorithms for a stub network to choose its upstream providers to maximize its utility (either monetary cost, reliability or performance). Third, we develop a model for interdomain network formation, incorporating the effects of economics, geography, and the provider/peer selections strategies of different types of networks. We use this model to examine the "outcome" of these strategies, in terms of the topology, economics and performance of the resulting internetwork. We also investigate the effect of external factors, such as the nature of the interdomain traffic matrix, customer preferences in provider selection, and pricing/cost structures. Finally, we focus on a recent trend due to the increasing amount of traffic flowing from content providers (who generate content), to access providers (who serve end users). This has led to a tussle between content providers and access providers, who have threatened to prioritize certain types of traffic, or charge content providers directly -- strategies that are viewed as violations of "network neutrality". In our work, we evaluate various pricing and connection strategies that access providers can use to remain profitable without violating network neutrality.Ph.D.Committee Chair: Dovrolis, Constantine; Committee Member: Ammar, Mostafa; Committee Member: Feamster, Nick; Committee Member: Willinger, Walter; Committee Member: Zegura, Elle
    corecore