26 research outputs found

    Divide and recombine: Autoregressive models and STL+

    Get PDF
    In this thesis multiple methods are proposed and applied to the Akamai CIDR time series data. The Akamai network is one of the world\u27s largest distributed-computing platforms, with more than 250,000 servers in more than 80 countries. It is responsible for 15-20 percent of all web traffic. We obtained 110 GB raw CIDR data over a 18 month period, collected on the Akamai network from November 2011 to April 2013. ^ The Seasonal-Trend Decomposition procedure based on loess (STL+) is used to model the CIDR series. Motivated by the CIDR series analysis, we propose a general prediction based model selection procedure, where extensive visual diagnostics are part of the procedure for selecting the best performing model. Factorial experimental designs are used to explore the parameter space. We evaluate the performance of different models for the CIDR series using our proposed prediction based model selection procedure. Furthermore the analysis and modeling of the CIDR series is performed under the Divide and Recombine for large data framework. And we conduct a theoretical Divide and Recombine time series estimation study. ^ We also study the performance of Divide and Recombine estimates for Gaussian auto-regressive time series, Gaussian long range dependent series, and auto-regressive series with tails heavier than Gaussian

    Detection of Sparse Anomalies in High-Dimensional Network Telescope Signals

    Full text link
    Network operators and system administrators are increasingly overwhelmed with incessant cyber-security threats ranging from malicious network reconnaissance to attacks such as distributed denial of service and data breaches. A large number of these attacks could be prevented if the network operators were better equipped with threat intelligence information that would allow them to block or throttle nefarious scanning activities. Network telescopes or "darknets" offer a unique window into observing Internet-wide scanners and other malicious entities, and they could offer early warning signals to operators that would be critical for infrastructure protection and/or attack mitigation. A network telescope consists of unused or "dark" IP spaces that serve no users, and solely passively observes any Internet traffic destined to the "telescope sensor" in an attempt to record ubiquitous network scanners, malware that forage for vulnerable devices, and other dubious activities. Hence, monitoring network telescopes for timely detection of coordinated and heavy scanning activities is an important, albeit challenging, task. The challenges mainly arise due to the non-stationarity and the dynamic nature of Internet traffic and, more importantly, the fact that one needs to monitor high-dimensional signals (e.g., all TCP/UDP ports) to search for "sparse" anomalies. We propose statistical methods to address both challenges in an efficient and "online" manner; our work is validated both with synthetic data as well as real-world data from a large network telescope

    From the edge to the core : towards informed vantage point selection for internet measurement studies

    Get PDF
    Since the early days of the Internet, measurement scientists are trying to keep up with the fast-paced development of the Internet. As the Internet grew organically over time and without build-in measurability, this process requires many workarounds and due diligence. As a result, every measurement study is only as good as the data it relies on. Moreover, data quality is relative to the research question—a data set suitable to analyze one problem may be insufficient for another. This is entirely expected as the Internet is decentralized, i.e., there is no single observation point from which we can assess the complete state of the Internet. Because of that, every measurement study needs specifically selected vantage points, which fit the research question. In this thesis, we present three different vantage points across the Internet topology— from the edge to the Internet core. We discuss their specific features, suitability for different kinds of research questions, and how to work with the corresponding data. The data sets obtained at the presented vantage points allow us to conduct three different measurement studies and shed light on the following aspects: (a) The prevalence of IP source address spoofing at a large European Internet Exchange Point (IXP), (b) the propagation distance of BGP communities, an optional transitive BGP attribute used for traffic engineering, and (c) the impact of the global COVID-19 pandemic on Internet usage behavior at a large Internet Service Provider (ISP) and three IXPs.Seit den frühen Tagen des Internets versuchen Forscher im Bereich Internet Measu- rement, mit der rasanten Entwicklung des des Internets Schritt zu halten. Da das Internet im Laufe der Zeit organisch gewachsen ist und nicht mit Blick auf Messbar- keit entwickelt wurde, erfordert dieser Prozess eine Meg Workarounds und Sorgfalt. Jede Measurement Studie ist nur so gut wie die Daten, auf die sie sich stützt. Und Datenqualität ist relativ zur Forschungsfrage - ein Datensatz, der für die Analyse eines Problems geeiget ist, kann für ein anderes unzureichend sein. Dies ist durchaus zu erwarten, da das Internet dezentralisiert ist, d. h. es gibt keinen einzigen Be- obachtungspunkt, von dem aus wir den gesamten Zustand des Internets beurteilen können. Aus diesem Grund benötigt jede Measurement Studie gezielt ausgewählte Beobachtungspunkte, die zur Forschungsfrage passen. In dieser Arbeit stellen wir drei verschiedene Beobachtungspunkte vor, die sich über die gsamte Internet-Topologie erstrecken— vom Rand bis zum Kern des Internets. Wir diskutieren ihre spezifischen Eigenschaften, ihre Eignung für verschiedene Klas- sen von Forschungsfragen und den Umgang mit den entsprechenden Daten. Die an den vorgestellten Beobachtungspunkten gewonnenen Datensätze ermöglichen uns die Durchführung von drei verschiedenen Measurement Studien und damit die folgenden Aspekte zu beleuchten: (a) Die Prävalenz von IP Source Address Spoofing bei einem großen europäischen Internet Exchange Point (IXP), (b) die Ausbreitungsdistanz von BGP-Communities, ein optionales transitives BGP-Attribut, das Anwendung im Bereich Traffic-Enigneering findet sowie (c) die Auswirkungen der globalen COVID- 19-Pandemie auf das Internet-Nutzungsverhalten an einem großen Internet Service Provider (ISP) und drei IXPs

    Improving the accuracy of spoofed traffic inference in inter-domain traffic

    Get PDF
    Ascertaining that a network will forward spoofed traffic usually requires an active probing vantage point in that network, effectively preventing a comprehensive view of this global Internet vulnerability. We argue that broader visibility into the spoofing problem may lie in the capability to infer lack of Source Address Validation (SAV) compliance from large, heavily aggregated Internet traffic data, such as traffic observable at Internet Exchange Points (IXPs). The key idea is to use IXPs as observatories to detect spoofed packets, by leveraging Autonomous System (AS) topology knowledge extracted from Border Gateway Protocol (BGP) data to infer which source addresses should legitimately appear across parts of the IXP switch fabric. In this thesis, we demonstrate that the existing literature does not capture several fundamental challenges to this approach, including noise in BGP data sources, heuristic AS relationship inference, and idiosyncrasies in IXP interconnec- tivity fabrics. We propose Spoofer-IX, a novel methodology to navigate these challenges, leveraging Customer Cone semantics of AS relationships to guide precise classification of inter-domain traffic as In-cone, Out-of-cone ( spoofed ), Unverifiable, Bogon, and Unas- signed. We apply our methodology on extensive data analysis using real traffic data from two distinct IXPs in Brazil, a mid-size and a large-size infrastructure. In the mid-size IXP with more than 200 members, we find an upper bound volume of Out-of-cone traffic to be more than an order of magnitude less than the previous method inferred on the same data, revealing the practical importance of Customer Cone semantics in such analysis. We also found no significant improvement in deployment of SAV in networks using the mid-size IXP between 2017 and 2019. In hopes that our methods and tools generalize to use by other IXPs who want to avoid use of their infrastructure for launching spoofed-source DoS attacks, we explore the feasibility of scaling the system to larger and more diverse IXP infrastructures. To promote this goal, and broad replicability of our results, we make the source code of Spoofer-IX publicly available. This thesis illustrates the subtleties of scientific assessments of operational Internet infrastructure, and the need for a community focus on reproducing and repeating previous methods.A constatação de que uma rede encaminhará tráfego falsificado geralmente requer um ponto de vantagem ativo de medição nessa rede, impedindo efetivamente uma visão abrangente dessa vulnerabilidade global da Internet. Isto posto, argumentamos que uma visibilidade mais ampla do problema de spoofing pode estar na capacidade de inferir a falta de conformidade com as práticas de Source Address Validation (SAV) a partir de dados de tráfego da Internet altamente agregados, como o tráfego observável nos Internet Exchange Points (IXPs). A ideia chave é usar IXPs como observatórios para detectar pacotes falsificados, aproveitando o conhecimento da topologia de sistemas autônomos extraído dos dados do protocolo BGP para inferir quais endereços de origem devem aparecer legitimamente nas comunicações através da infra-estrutura de um IXP. Nesta tese, demonstramos que a literatura existente não captura diversos desafios fundamentais para essa abordagem, incluindo ruído em fontes de dados BGP, inferência heurística de relacionamento de sistemas autônomos e características específicas de interconectividade nas infraestruturas de IXPs. Propomos o Spoofer-IX, uma nova metodologia para superar esses desafios, utilizando a semântica do Customer Cone de relacionamento de sistemas autônomos para guiar com precisão a classificação de tráfego inter-domínio como In-cone, Out-of-cone ( spoofed ), Unverifiable, Bogon, e Unassigned. Aplicamos nossa metodologia em análises extensivas sobre dados reais de tráfego de dois IXPs distintos no Brasil, uma infraestrutura de médio porte e outra de grande porte. No IXP de tamanho médio, com mais de 200 membros, encontramos um limite superior do volume de tráfego Out-of-cone uma ordem de magnitude menor que o método anterior inferiu sob os mesmos dados, revelando a importância prática da semântica do Customer Cone em tal análise. Além disso, não encontramos melhorias significativas na implantação do Source Address Validation (SAV) em redes usando o IXP de tamanho médio entre 2017 e 2019. Na esperança de que nossos métodos e ferramentas sejam aplicáveis para uso por outros IXPs que desejam evitar o uso de sua infraestrutura para iniciar ataques de negação de serviço através de pacotes de origem falsificada, exploramos a viabilidade de escalar o sistema para infraestruturas IXP maiores e mais diversas. Para promover esse objetivo e a ampla replicabilidade de nossos resultados, disponibilizamos publicamente o código fonte do Spoofer-IX. Esta tese ilustra as sutilezas das avaliações científicas da infraestrutura operacional da Internet e a necessidade de um foco da comunidade na reprodução e repetição de métodos anteriores

    System designs for bulk and user-generated content delivery in the internet

    Get PDF
    This thesis proposes and evaluates new system designs to support two emerging Internet workloads: (a) bulk content, such as downloads of large media and scientific libraries, and (b) user-generated content (UGC), such as photos and videos that users share online, typically on online social networks (OSNs). Bulk content accounts for a large and growing fraction of today\u27s Internet traffic. Due to the high cost of bandwidth, delivering bulk content in the Internet is expensive. To reduce the cost of bulk transfers, I proposed traffic shaping and scheduling designs that exploit the delay-tolerant nature of bulk transfers to allow ISPs to deliver bulk content opportunistically. I evaluated my proposals through software prototypes and simulations driven by real-world traces from commercial and academic ISPs and found that they result in considerable reductions in transit costs or increased link utilization. The amount of user-generated content (UGC) that people share online has been rapidly growing in the past few years. Most users share UGC using online social networking websites (OSNs), which can impose arbitrary terms of use, privacy policies, and limitations on the content shared on their websites. To solve this problem, I evaluated the feasibility of a system that allows users to share UGC directly from the home, thus enabling them to regain control of the content that they share online. Using data from popular OSN websites and a testbed deployed in 10 households, I showed that current trends bode well for the delivery of personal UGC from users\u27 homes. I also designed and deployed Stratus, a prototype system that uses home gateways to share UGC directly from the home.Schwerpunkt dieser Doktorarbeit ist der Entwurf und die Auswertung neuer Systeme zur Unterstützung von zwei entstehenden Internet-Workloads: (a) Bulk-Content, wie zum Beispiel die Übertragung von großen Mediendateien und wissenschaftlichen Datenbanken, und (b) nutzergenerierten Inhalten, wie zum Beispiel Fotos und Videos, die Benutzer üblicherweise in sozialen Netzwerken veröffentlichen. Bulk-Content macht einen großen und weiter zunehmenden Anteil im heutigen Internetverkehr aus. Wegen der hohen Bandbreitenkosten ist die Übertragung von Bulk-Content im Internet jedoch teuer. Um diese Kosten zu senken habe ich neue Scheduling- und Traffic-Shaping-Lösungen entwickelt, die die Verzögerungsresistenz des Bulk-Verkehrs ausnutzen und es ISPs ermöglichen, Bulk-Content opportunistisch zu übermitteln. Durch Software-Prototypen und Simulationen mit Daten aus dem gewerblichen und akademischen Internet habe ich meine Lösungen ausgewertet und herausgefunden, dass sich die Übertragungskosten dadurch erheblich senken lassen und die Ausnutzung der Netze verbessern lässt. Der Anteil an nutzergenerierten Inhalten (user-generated content, UGC), die im Internet veröffentlicht wird, hat in den letzen Jahren ebenfalls schnell zugenommen. Meistens wird UGC in sozialen Netzwerken (online social networks, OSN) veröffentlicht. Dadurch sind Benutzer den willkürlichen Nutzungsbedingungen, Datenschutzrichtlinien, und Einschränkungen des OSN-Providers unterworfen. Um dieses Problem zu lösen, habe ich die Machbarkeit eines Systems ausgewertet, anhand dessen Benutzer UGC direkt von zu Hause veröffentlichen und die Kontrolle über ihren UGC zurückgewinnen können. Meine Auswertung durch Daten aus zwei populären OSN-Websites und einem Feldversuch in 10 Haushalten deutet darauf hin, dass angesichts der Fortschritte in der Bandbreite der Zugangsnetze die Veröffentlichung von persönlichem UGC von zu Hause in der nahen Zukunft möglich sein könnte.Schließlich habe ich Stratus entworfen und entwickelt, ein System, das auf Home-Gateways basiert und mit dem Benutzer UGC direkt von zu Hause veröffentlichen können

    Anomaly detection in SCADA systems: a network based approach

    Get PDF
    Supervisory Control and Data Acquisition (SCADA) networks are commonly deployed to aid the operation of large industrial facilities, such as water treatment facilities. Historically, these networks were composed by special-purpose embedded devices communicating through proprietary protocols. However, modern deployments commonly make use of commercial off-the-shelf devices and standard communication protocols, such as TCP/IP. Furthermore, these networks are becoming increasingly interconnected, allowing communication with corporate networks and even the Internet. As a result, SCADA networks become vulnerable to cyber attacks, being exposed to the same threats that plague traditional IT systems.\ud \ud In our view, measurements play an essential role in validating results in network research; therefore, our first objective is to understand how SCADA networks are utilized in practice. To this end, we provide the first comprehensive analysis of real-world SCADA traffic. We analyze five network packet traces collected at four different critical infrastructures: two water treatment facilities, one gas utility, and one electricity and gas utility. We show, for instance, that exiting network traffic models developed for traditional IT networks cannot be directly applied to SCADA network traffic. \ud \ud We also confirm two SCADA traffic characteristics: the stable connection matrix and the traffic periodicity, and propose two intrusion detection approaches that exploit them. In order to exploit the stable connection matrix, we investigate the use of whitelists at the flow level. We show that flow whitelists have a manageable size, considering the number of hosts in the network, and that it is possible to overcome the main sources of instability in the whitelists. In order to exploit the traffic periodicity, we focus our attention to connections used to retrieve data from devices in the field network. We propose PeriodAnalyzer, an approach that uses deep packet inspection to automatically identify the different messages and the frequency at which they are issued. Once such normal behavior is learned, PeriodAnalyzer can be used to detect data injection and Denial of Service attacks

    Characterising the Social Media Temporal Response to External Events

    Get PDF
    In recent years social media has become a crucial component of online information propagation. It is one of the fastest responding mediums to offline events, significantly faster than traditional news services. Popular social media posts can spread rapidly through the internet, potentially spreading misinformation and affecting human beliefs and behaviour. The nature of how social media responds allows inference about events themselves and provides insight into human behavioural characteristics. However, despite its importance, researchers don’t have a strong understanding of the temporal dynamics of this information flow. This thesis aims to improve understanding of the temporal relationship between events, news and associated social media activity. We do this by examining the temporal Twitter response to stimuli for various case studies, primarily based around politics and sporting events. The first part of the thesis focuses on the relationships between Twitter and news media. Using Granger causality, we provide evidence that the social media reaction to events is faster than the traditional news reaction. We also consider how accurately tweet and news volumes can be predicted, given other variables. The second part of the thesis examines information cascades. We show that the decay of retweet rates is well-modelled as a power law with exponential cutoff, providing a better model than the widely used power law. This finding, explained using human prioritisation of tasks, then allows the development of a method to estimate the size of a retweet cascade. The third major part of the thesis concerns tweet clustering methods in response to events. We examine how the likelihood that two tweets are related varies, given the time difference between them, and use this finding to create a clustering method using both textual and temporal information. We also develop a method to estimate the time of the event that caused the corresponding social media reaction.Thesis (Ph.D.) -- University of Adelaide, School of Mathematical Sciences, 201

    Internet traffic volumes characterization and forecasting

    Get PDF
    Internet usage increases every year and the need to estimate the growth of the generated traffic has become a major topic. Forecasting actual figures in advance is essential for bandwidth allocation, networking design and investment planning. In this thesis novel mathematical equations are presented to model and to predict long-term Internet traffic in terms of total aggregating volume, globally and more locally. Historical traffic data from consecutive years have revealed hidden numerical patterns as the values progress year over year and this trend can be well represented with appropriate mathematical relations. The proposed formulae have excellent fitting properties over long-history measurements and can indicate forthcoming traffic for the next years with an exceptionally low prediction error. In cases where pending traffic data have already become available, the suggested equations provide more successful results than the respective projections that come from worldwide leading research. The studies also imply that future traffic strongly depends on the past activity and on the growth of Internet users, provided that a big and representative sample of pertinent data exists from large geographical areas. To the best of my knowledge this work is the first to introduce effective prediction methods that exclusively rely on the static attributes and the progression properties of historical values
    corecore