13 research outputs found

    Graph Sample and Hold: A Framework for Big-Graph Analytics

    Full text link
    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

    Edge-based mining of frequent subgraphs from graph streams

    Get PDF
    In the current era of Big data, high volumes of valuable data can be generated at a high velocity from high-varieties of data sources in various real-life applications ranging from sensor networks to social networks, from bio-informatics to chemical informatics. In addition, Big data are also available in business, education, engineering, finance, healthcare, scientific, telecommunication, and transportation domains. A collection of these data can be viewed as a big dynamic graph structure. Embedded in them are implicit, previously unknown, and potentially useful knowledge. Consequently, efficient knowledge discovery algorithms for mining frequent subgraphs from these dynamic streaming graph structured data are in demand. On the one hand, some existing algorithms discover collections of frequently co-occurring edges, which may be disjoint. On the other hand, some other existing algorithms discover frequent subgraphs by requiring very large memory space. With high volumes of Big data, available memory space may be limited. To discover collections of frequently co-occurring connected edges, we present in this paper two efficient algorithms that require small memory space. Evaluation results show the efficiency of our edge-based algorithms in mining frequent subgraphs from graph streams

    Next Generation Machine Learning Based Real Time Fraud Detection

    Get PDF
    Define a real time monitoring architecture that can scale as the network of devices monitored grows. From the research work carried out and the knowledge about the nature of the business, it was possible to develop a clustering methodology over the data streams that allows to detect patterns on entities. The methodology used is based on the concept of micro-cluster, which is a structure that maintains a summary of the patterns detected on entities.In telecommunications there are several schemes to defraud the telecommunications companies causing great financial losses. We can considerer three major categories in telecom fraud based on who the fraudsters are targeting. These categories are: Traffic Pumping Schemes, Defraud Telecom Service Providers, Conducted Over the Telephone. Traffic Pumping Schemes use "access stimulation" techniques to boost traffic to a high cost destination, which then shares the revenue with the fraudster. Defraud Telecom Service Providers are the most complicated, and exploit telecom service providers using SIP trunking, regulatory loopholes, and more. Conducted Over the Telephone, also known as "Phone Fraud", this category covers all types of general fraud that are perpetrated over the telephone. Telecommunications fraud negatively impacts everyone, including good paying customers. The losses increase the companies operating costs. While telecom companies take every measure to stop the fraud and reduce their losses, the criminals continue their attacks on companies with perceived weaknesses. The telecom business is facing a serious hazard growing as fast as the industry itself. Communications Fraud Control Association (CFCA) stated that telecom fraud represented nearly $30 billion globally in 2017 cite{telecomengine}. Another problem is to stay on top of the game with effective anti-fraud technologies. The need to ensure a secure and trustable Internet of Things (IoT) network brings the challenge to continuously monitor massive volumes of machine data in streaming. Therefore a different approach is required in the scope of Fraud Detection, where detection engines need to detect risk situations in real time and be able to adapt themselves to evolving behavior patterns. Machine learning based online anomaly detection can support this new approach. For applications involving several data streams, the challenge of detecting anomalies has become harder over time, as data can dynamically evolve in subtle ways following changes in the underlying infrastructure. The goal of this paper is to research existing online anomaly detection algorithms to select a set of candidates in order to test them in Fraud Detection scenarios

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Discovering Dynamic Communities in Interaction Networks

    Full text link
    corecore