536 research outputs found

    Efficient Identification of TOP-K Heavy Hitters over Sliding Windows

    Get PDF
    This is the author accepted manuscript. The final version is available from Springer Verlag via the DOI in this recordDue to the increasing volume of network traffic and growing complexity of network environment, rapid identification of heavy hitters is quite challenging. To deal with the massive data streams in real-time, accurate and scalable solution is required. The traditional method to keep an individual counter for each host in the whole data streams is very resource-consuming. This paper presents a new data structure called FCM and its associated algorithms. FCM combines the count-min sketch with the stream-summary structure simultaneously for efficient TOP-K heavy hitter identification in one pass. The key point of this algorithm is that it introduces a novel filter-and-jump mechanism. Given that the Internet traffic has the property of being heavy-tailed and hosts of low frequencies account for the majority of the IP addresses, FCM periodically filters the mice from input streams to efficiently improve the accuracy of TOP-K heavy hitter identification. On the other hand, considering that abnormal events are always time sensitive, our algorithm works by adjusting its measurement window to the newly arrived elements in the data streams automatically. Our experimental results demonstrate that the performance of FCM is superior to the previous related algorithm. Additionally this solution has a good prospect of application in advanced network environment.Chinese Academy of SciencesNational Natural Science Foundation of Chin

    Monitoring frequent items over distributed data streams.

    Get PDF
    Many important applications require the discovery of items which have occurred frequently. Knowledge of these items is commonly used in anomaly detection and network monitoring tasks. Effective solutions for this problem focus mainly on reducing memory requirements in a centralized environment. These solutions, however, ignore the inherently distributed nature of many systems. Naively forwarding data to a centralized location is not practical when dealing with high speed data streams and will result in significant communication overhead. This thesis proposes a new approach designed for continuously tracking frequent items over distributed data streams, providing either exact or approximate answers. The method introduced is a direct modification to an existing communication efficient algorithm called Top-K, Monitoring. Experimental results demonstrated that the proposed modifications significantly reduced communication cost and improved scalability. Also examined in this thesis is the applicability of frequent item monitoring at detecting distributed denial of service attacks. Simulation of the proposed tracking method against four different attack patterns was conducted. The outcome of these experiments showed promising results when compared to previous detection methods

    A survey on online active learning

    Full text link
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work

    Distributed reservoir sampling algorithms for data pre-processing with use of Kafka Streams

    Get PDF
    Με την ανάπτυξη του IoT και με τον αριθμό των συσκευών που αναμένεται να συνδεθούν σε αυτό να ξεπερνάει τα 30 δισεκατομμύρια μέχρι το 2020 καθώς και με την συνεπακόλουθη αύξηση στα δεδομένα που μεταδίδονται κρίνεται αναγκαίο από τα σύγχρονα συστήματα επεξεργασίας δεδομένων μεγάλης κλίμακας να χρησιμοποιούν αποδοτικούς αλγορίθμους σε συνδυασμό με προγραμματιστικές βιβλιοθήκες που χρησιμοποιούνται ευρέως στον τομέα της βιομηχανίας. Σκοπός της διπλωματικής εργασίας είναι η ανάλυση και παρουσίαση αλγορίθμων αποθέματος καθώς και η ανάπτυξη τους με την χρήση της βιβλιοθήκης Kafka Streams με σκοπό την επίλυση του προβλήματος της κατανομής τους. Αξιοποιώντας τις ιδιαιτερότητες της βιβλιοθήκης και των αλγορίθμων στοχεύουμε στην υλοποίηση ενός εργαλείου που βοηθάει αναλυτές και πειραματιστές στο τομέα του ΙοΤ στην προεπεξεργασία των δεδομένων και την ταχεία λήψη αποτελεσμάτων από μια συνεχόμενη ροή δεδομένων.With the rapid growth of the Internet of Things (IoT) and with the number of devices expected to connect to it estimated to exceed 30 billion by 2020 and the consequent increase in data transmitted, it is necessary for big data processing systems to use efficient algorithms in combination with programming libraries that are widely used in the industry. This master thesis aims to analyze and present reservoir sampling algorithms as well as to develop them using the Kafka Streams API in order to solve the problem of their distribution. By taking advantage of the API and the algorithm specific characteristics, we aim to implement a tool that helps analysts and experimenters on the IoT field to preprocess data and quickly obtain results from a continuous data stream

    슬라이딩 윈도우상의 빠른 점진적 밀도 기반 클러스터링

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2022. 8. 문봉기.Given the prevalence of mobile and IoT devices, continuous clustering against streaming data has become an essential tool of increasing importance for data analytics. Among many clustering approaches, density-based clustering has garnered much attention due to its unique advantage that it can detect clusters of an arbitrary shape when noise exists. However, when the clusters need to be updated continuously along with an evolving input dataset, a relatively high computational cost is required. Particularly, deleting data points from the clusters causes severe performance degradation. In this dissertation, the performance limits of the incremental density-based clustering over sliding windows are addressed. Ultimately, two algorithms, DISC and DenForest, are proposed. The first algorithm DISC is an incremental density-based clustering algorithm that efficiently produces the same clustering results as DBSCAN over sliding windows. It focuses on redundancy issues that occur when updating clusters. When multiple data points are inserted or deleted individually, surrounding data points are explored and retrieved redundantly. DISC addresses these issues and improves the performance by updating multiple points in a batch. It also presents several optimization techniques. The second algorithm DenForest is an incremental density-based clustering algorithm that primarily focuses on the deletion process. Unlike previous methods that manage clusters as a graph, DenForest manages clusters as a group of spanning trees, which contributes to very efficient deletion performance. Moreover, it provides a batch-optimized technique to improve the insertion performance. To prove the effectiveness of the two algorithms, extensive evaluations were conducted, and it is demonstrated that DISC and DenForest outperform the state-of-the-art density-based clustering algorithms significantly.모바일 및 IoT 장치가 널리 보급됨에 따라 스트리밍 데이터상에서 지속적으로 클러스터링 작업을 수행하는 것은 데이터 분석에서 점점 더 중요해지는 필수 도구가 되었습니다. 많은 클러스터링 방법 중에서 밀도 기반 클러스터링은 노이즈가 존재할 때 임의의 모양의 클러스터를 감지할 수 있다는 고유한 장점을 가지고 있으며 이에 따라 많은 관심을 받았습니다. 그러나 밀도 기반 클러스터링은 변화하는 입력 데이터 셋에 따라 지속적으로 클러스터를 업데이트해야 하는 경우 비교적 높은 계산 비용이 필요합니다. 특히, 클러스터에서의 데이터 점들의 삭제는 심각한 성능 저하를 초래합니다. 본 박사 학위 논문에서는 슬라이딩 윈도우상의 밀도 기반 클러스터링의 성능 한계를 다루며 궁극적으로 두 가지 알고리즘을 제안합니다. 첫 번째 알고리즘인 DISC는 슬라이딩 윈도우상에서 DBSCAN과 동일한 클러스터링 결과를 찾는 점진적 밀도 기반 클러스터링 알고리즘입니다. 해당 알고리즘은 클러스터 업데이트 시에 발생하는 중복 문제들에 초점을 둡니다. 밀도 기반 클러스터링에서는 여러 데이터 점들을 개별적으로 삽입 혹은 삭제할 때 주변 점들을 불필요하게 중복적으로 탐색하고 회수합니다. DISC 는 배치 업데이트로 이 문제를 해결하여 성능을 향상시키며 여러 최적화 방법들을 제안합니다. 두 번째 알고리즘인 DenForest 는 삭제 과정에 초점을 둔 점진적 밀도 기반 클러스터링 알고리즘입니다. 클러스터를 그래프로 관리하는 이전 방법들과 달리 DenForest 는 클러스터를 신장 트리의 그룹으로 관리함으로써 효율적인 삭제 성능에 기여합니다. 나아가 배치 최적화 기법을 통해 삽입 성능 향상에도 기여합니다. 두 알고리즘의 효율성을 입증하기 위해 광범위한 평가를 수행하였으며 DISC 및 DenForest 는 최신의 밀도 기반 클러스터링 알고리즘들보다 뛰어난 성능을 보여주었습니다.1 Introduction 1 1.1 Overview of Dissertation 3 2 Related Works 7 2.1 Clustering 7 2.2 Density-Based Clustering for Static Datasets 8 2.2.1 Extension of DBSCAN 8 2.2.2 Approximation of Density-Based Clustering 9 2.2.3 Parallelization of Density-Based Clustering 10 2.3 Incremental Density-Based Clustering 10 2.3.1 Approximated Density-Based Clustering for Dynamic Datasets 11 2.4 Density-Based Clustering for Data Streams 11 2.4.1 Micro-clusters 12 2.4.2 Density-Based Clustering in Damped Window Model 12 2.4.3 Density-Based Clustering in Sliding Window Model 13 2.5 Non-Density-Based Clustering 14 2.5.1 Partitional Clustering and Hierarchical Clustering 14 2.5.2 Distribution-Based Clustering 15 2.5.3 High-Dimensional Data Clustering 15 2.5.4 Spectral Clustering 16 3 Background 17 3.1 DBSCAN 17 3.1.1 Reformulation of Density-Based Clustering 19 3.2 Incremental DBSCAN 20 3.3 Sliding Windows 22 3.3.1 Density-Based Clustering over Sliding Windows 23 3.3.2 Slow Deletion Problem 24 4 Avoiding Redundant Searches in Updating Clusters 26 4.1 The DISC Algorithm 27 4.1.1 Overview of DISC 27 4.1.2 COLLECT 29 4.1.3 CLUSTER 30 4.1.3.1 Splitting a Cluster 32 4.1.3.2 Merging Clusters 37 4.1.4 Horizontal Manner vs. Vertical Manner 38 4.2 Checking Reachability 39 4.2.1 Multi-Starter BFS 40 4.2.2 Epoch-Based Probing of R-tree Index 41 4.3 Updating Labels 43 5 Avoiding Graph Traversals in Updating Clusters 45 5.1 The DenForest Algorithm 46 5.1.1 Overview of DenForest 47 5.1.1.1 Supported Types of the Sliding Window Model 48 5.1.2 Nostalgic Core and Density-based Clusters 49 5.1.2.1 Cluster Membership of Border 51 5.1.3 DenTree 51 5.2 Operations of DenForest 54 5.2.1 Insertion 54 5.2.1.1 MST based on Link-Cut Tree 57 5.2.1.2 Time Complexity of Insert Operation 58 5.2.2 Deletion 59 5.2.2.1 Time Complexity of Delete Operation 61 5.2.3 Insertion/Deletion Examples 64 5.2.4 Cluster Membership 65 5.2.5 Batch-Optimized Update 65 5.3 Clustering Quality of DenForest 68 5.3.1 Clustering Quality for Static Data 68 5.3.2 Discussion 70 5.3.3 Replaceability 70 5.3.3.1 Nostalgic Cores and Density 71 5.3.3.2 Nostalgic Cores and Quality 72 5.3.4 1D Example 74 6 Evaluation 76 6.1 Real-World Datasets 76 6.2 Competing Methods 77 6.2.1 Exact Methods 77 6.2.2 Non-Exact Methods 77 6.3 Experimental Settings 78 6.4 Evaluation of DISC 78 6.4.1 Parameters 79 6.4.2 Baseline Evaluation 79 6.4.3 Drilled-Down Evaluation 82 6.4.3.1 Effects of Threshold Values 82 6.4.3.2 Insertions vs. Deletions 83 6.4.3.3 Range Searches 84 6.4.3.4 MS-BFS and Epoch-Based Probing 85 6.4.4 Comparison with Summarization/Approximation-Based Methods 86 6.5 Evaluation of DenForest 90 6.5.1 Parameters 90 6.5.2 Baseline Evaluation 91 6.5.3 Drilled-Down Evaluation 94 6.5.3.1 Varying Size of Window/Stride 94 6.5.3.2 Effect of Density and Distance Thresholds 95 6.5.3.3 Memory Usage 98 6.5.3.4 Clustering Quality over Sliding Windows 98 6.5.3.5 Clustering Quality under Various Density and Distance Thresholds 101 6.5.3.6 Relaxed Parameter Settings 102 6.5.4 Comparison with Summarization-Based Methods 102 7 Future Work: Extension to Varying/Relative Densities 105 8 Conclusion 107 Abstract (In Korean) 120박

    Click fraud : how to spot it, how to stop it?

    Get PDF
    Online search advertising is currently the greatest source of revenue for many Internet giants such as Google™, Yahoo!™, and Bing™. The increased number of specialized websites and modern profiling techniques have all contributed to an explosion of the income of ad brokers from online advertising. The single biggest threat to this growth is however click fraud. Trained botnets and even individuals are hired by click-fraud specialists in order to maximize the revenue of certain users from the ads they publish on their websites, or to launch an attack between competing businesses. Most academics and consultants who study online advertising estimate that 15% to 35% of ads in pay per click (PPC) online advertising systems are not authentic. In the first two quarters of 2010, US marketers alone spent 5.7billiononPPCads,wherePPCadsarebetween45and50percentofallonlineadspending.Onaverageabout5.7 billion on PPC ads, where PPC ads are between 45 and 50 percent of all online ad spending. On average about 1.5 billion is wasted due to click-fraud. These fraudulent clicks are believed to be initiated by users in poor countries, or botnets, who are trained to click on specific ads. For example, according to a 2010 study from Information Warfare Monitor, the operators of Koobface, a program that installed malicious software to participate in click fraud, made over $2 million in just over a year. The process of making such illegitimate clicks to generate revenue is called click-fraud. Search engines claim they filter out most questionable clicks and either not charge for them or reimburse advertisers that have been wrongly billed. However this is a hard task, despite the claims that brokers\u27 efforts are satisfactory. In the simplest scenario, a publisher continuously clicks on the ads displayed on his own website in order to make revenue. In a more complicated scenario. a travel agent may hire a large, globally distributed, botnet to click on its competitor\u27s ads, hence depleting their daily budget. We analyzed those different types of click fraud methods and proposed new methodologies to detect and prevent them real time. While traditional commercial approaches detect only some specific types of click fraud, Collaborative Click Fraud Detection and Prevention (CCFDP) system, an architecture that we have implemented based on the proposed methodologies, can detect and prevents all major types of click fraud. The proposed solution analyzes the detailed user activities on both, the server side and client side collaboratively to better describe the intention of the click. Data fusion techniques are developed to combine evidences from several data mining models and to obtain a better estimation of the quality of the click traffic. Our ideas are experimented through the development of the Collaborative Click Fraud Detection and Prevention (CCFDP) system. Experimental results show that the CCFDP system is better than the existing commercial click fraud solution in three major aspects: 1) detecting more click fraud especially clicks generated by software; 2) providing prevention ability; 3) proposing the concept of click quality score for click quality estimation. In the CCFDP initial version, we analyzed the performances of the click fraud detection and prediction model by using a rule base algorithm, which is similar to most of the existing systems. We have assigned a quality score for each click instead of classifying the click as fraud or genuine, because it is hard to get solid evidence of click fraud just based on the data collected, and it is difficult to determine the real intention of users who make the clicks. Results from initial version revealed that the diversity of CF attack Results from initial version revealed that the diversity of CF attack types makes it hard for a single counter measure to prevent click fraud. Therefore, it is important to be able to combine multiple measures capable of effective protection from click fraud. Therefore, in the CCFDP improved version, we provide the traffic quality score as a combination of evidence from several data mining algorithms. We have tested the system with a data from an actual ad campaign in 2007 and 2008. We have compared the results with Google Adwords reports for the same campaign. Results show that a higher percentage of click fraud present even with the most popular search engine. The multiple model based CCFDP always estimated less valid traffic compare to Google. Sometimes the difference is as high as 53%. Detection of duplicates, fast and efficient, is one of the most important requirement in any click fraud solution. Usually duplicate detection algorithms run in real time. In order to provide real time results, solution providers should utilize data structures that can be updated in real time. In addition, space requirement to hold data should be minimum. In this dissertation, we also addressed the problem of detecting duplicate clicks in pay-per-click streams. We proposed a simple data structure, Temporal Stateful Bloom Filter (TSBF), an extension to the regular Bloom Filter and Counting Bloom Filter. The bit vector in the Bloom Filter was replaced with a status vector. Duplicate detection results of TSBF method is compared with Buffering, FPBuffering, and CBF methods. False positive rate of TSBF is less than 1% and it does not have false negatives. Space requirement of TSBF is minimal among other solutions. Even though Buffering does not have either false positives or false negatives its space requirement increases exponentially with the size of the stream data size. When the false positive rate of the FPBuffering is set to 1% its false negative rate jumps to around 5%, which will not be tolerated by most of the streaming data applications. We also compared the TSBF results with CBF. TSBF uses only half the space or less than standard CBF with the same false positive probability. One of the biggest successes with CCFDP is the discovery of new mercantile click bot, the Smart ClickBot. We presented a Bayesian approach for detecting the Smart ClickBot type clicks. The system combines evidence extracted from web server sessions to determine the final class of each click. Some of these evidences can be used alone, while some can be used in combination with other features for the click bot detection. During training and testing we also addressed the class imbalance problem. Our best classifier shows recall of 94%. and precision of 89%, with F1 measure calculated as 92%. The high accuracy of our system proves the effectiveness of the proposed methodology. Since the Smart ClickBot is a sophisticated click bot that manipulate every possible parameters to go undetected, the techniques that we discussed here can lead to detection of other types of software bots too. Despite the enormous capabilities of modern machine learning and data mining techniques in modeling complicated problems, most of the available click fraud detection systems are rule-based. Click fraud solution providers keep the rules as a secret weapon and bargain with others to prove their superiority. We proposed validation framework to acquire another model of the clicks data that is not rule dependent, a model that learns the inherent statistical regularities of the data. Then the output of both models is compared. Due to the uniqueness of the CCFDP system architecture, it is better than current commercial solution and search engine/ISP solution. The system protects Pay-Per-Click advertisers from click fraud and improves their Return on Investment (ROI). The system can also provide an arbitration system for advertiser and PPC publisher whenever the click fraud argument arises. Advertisers can gain their confidence on PPC advertisement by having a channel to argue the traffic quality with big search engine publishers. The results of this system will booster the internet economy by eliminating the shortcoming of PPC business model. General consumer will gain their confidence on internet business model by reducing fraudulent activities which are numerous in current virtual internet world

    Probabilistic data types

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaConflict-Free Replicated Data Types (CRDTs) provide deterministic outcomes from concurrent executions. The conflict resolution mechanism uses information on the ordering of the last operations performed, which indicates if a given operation is known by a replica, typically using some variant of version vectors. This thesis will explore the construction of CRDTs that use a novel stochastic mechanism that can track with high accuracy knowledge of the occurrence of recently performed operations and with less accuracy for older operations. The aim is to obtain better scaling properties and avoid the use of metadata that is linear on the number of replicas.Conflict-Free Replicated Data Types (CRDTs) oferecem resultados determinísticos de execuções concorrentes. O mecanismo de resolução de conflitos usa informação sobre a ordenação das últimas operações realizadas, que indica se uma dada operação é conhecida por uma réplica, geralmente usando alguma variante de version vectors. Esta tese explorara a construção de CRDTs que utilizam um novo mecanismo estocástico que pode identificar com alta precisão o conhecimento sobre a ocorrência de operações realizadas recentemente e com menor precisão para operações mais antigas. O objetivo é a obtenção de melhores propriedades de escalabilidade e evitar o uso de metadados em quantidade linear em relação ao número de réplicas
    corecore