354 research outputs found

    Mining Frequent Item sets in Data Streams

    Get PDF

    TAPER: query-aware, partition-enhancement for large, heterogenous, graphs

    Full text link
    Graph partitioning has long been seen as a viable approach to address Graph DBMS scalability. A partitioning, however, may introduce extra query processing latency unless it is sensitive to a specific query workload, and optimised to minimise inter-partition traversals for that workload. Additionally, it should also be possible to incrementally adjust the partitioning in reaction to changes in the graph topology, the query workload, or both. Because of their complexity, current partitioning algorithms fall short of one or both of these requirements, as they are designed for offline use and as one-off operations. The TAPER system aims to address both requirements, whilst leveraging existing partitioning algorithms. TAPER takes any given initial partitioning as a starting point, and iteratively adjusts it by swapping chosen vertices across partitions, heuristically reducing the probability of inter-partition traversals for a given pattern matching queries workload. Iterations are inexpensive thanks to time and space optimisations in the underlying support data structures. We evaluate TAPER on two different large test graphs and over realistic query workloads. Our results indicate that, given a hash-based partitioning, TAPER reduces the number of inter-partition traversals by around 80%; given an unweighted METIS partitioning, by around 30%. These reductions are achieved within 8 iterations and with the additional advantage of being workload-aware and usable online.Comment: 12 pages, 11 figures, unpublishe

    Mining frequent sequential patterns in data streams using SSM-algorithm.

    Get PDF
    Frequent sequential mining is the process of discovering frequent sequential patterns in data sequences as found in applications like web log access sequences. In data stream applications, data arrive at high speed rates in a continuous flow. Data stream mining is an online process different from traditional mining. Traditional mining algorithms work on an entire static dataset in order to obtain results while data stream mining algorithms work with continuously arriving data streams. With rapid change in technology, there are many applications that take data as continuous streams. Examples include stock tickers, network traffic measurements, click stream data, data feeds from sensor networks, and telecom call records. Mining frequent sequential patterns on data stream applications contend with many challenges such as limited memory for unlimited data, inability of algorithms to scan infinitely flowing original dataset more than once and to deliver current and accurate result on demand. This thesis proposes SSM-Algorithm (sequential stream mining-algorithm) that delivers frequent sequential patterns in data streams. The concept of this work came from FP-Stream algorithm that delivers time sensitive frequent patterns. Proposed SSM-Algorithm outperforms FP-Stream algorithm by the use of a hash based and two efficient tree based data structures. All incoming streams are handled dynamically to improve memory usage. SSM-Algorithm maintains frequent sequences incrementally and delivers most current result on demand. The introduced algorithm can be deployed to analyze e-commerce data where the primary source of the data is click stream data. (Abstract shortened by UMI.)Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .M668. Source: Masters Abstracts International, Volume: 44-03, page: 1409. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Modelling Web Usage in a Changing Environment

    Get PDF
    Eiben, A.E. [Promotor]Kowalczyk, W. [Copromotor

    Rethinking Privacy and Security Mechanisms in Online Social Networks

    Get PDF
    With billions of users, Online Social Networks(OSNs) are amongst the largest scale communication applications on the Internet. OSNs enable users to easily access news from local and worldwide, as well as share information publicly and interact with friends. On the negative side, OSNs are also abused by spammers to distribute ads or malicious information, such as scams, fraud, and even manipulate public political opinions. Having achieved significant commercial success with large amount of user information, OSNs do treat the security and privacy of their users seriously and provide several mechanisms to reinforce their account security and information privacy. However, the efficacy of those measures is either not thoroughly validated or in need to be improved. In sight of cyber criminals and potential privacy threats on OSNs, we focus on the evaluations and improvements of OSN user privacy configurations, account security protection mechanisms, and trending topic security in this dissertation. We first examine the effectiveness of OSN privacy settings on protecting user privacy. Given each privacy configuration, we propose a corresponding scheme to reveal the target user\u27s basic profile and connection information starting from some leaked connections on the user\u27s homepage. Based on the dataset we collected on Facebook, we calculate the privacy exposure in each privacy setting type and measure the accuracy of our privacy inference schemes with different amount of public information. The evaluation results show that (1) a user\u27s private basic profile can be inferred with high accuracy and (2) connections can be revealed in a significant portion based on even a small number of directly leaked connections. Secondly, we propose a behavioral-profile-based method to detect OSN user account compromisation in a timely manner. Specifically, we propose eight behavioral features to portray a user\u27s social behavior. A user\u27s statistical distributions of those feature values comprise its behavioral profile. Based on the sample data we collected from Facebook, we observe that each user\u27s activities are highly likely to conform to its behavioral profile while two different user\u27s profile tend to diverge from each other, which can be employed for compromisation detection. The evaluation result shows that the more complete and accurate a user\u27s behavioral profile can be built the more accurately compromisation can be detected. Finally, we investigate the manipulation of OSN trending topics. Based on the dataset we collected from Twitter, we manifest the manipulation of trending and a suspect spamming infrastructure. We then measure how accurately the five factors (popularity, coverage, transmission, potential coverage, and reputation) can predict trending using an SVM classifier. We further study the interaction patterns between authenticated accounts and malicious accounts in trending. at last we demonstrate the threats of compromised accounts and sybil accounts to trending through simulation and discuss countermeasures against trending manipulation

    Workload-sensitive approaches to improving graph data partitioning online

    Get PDF
    PhD ThesisMany modern applications, from social networks to network security tools, rely upon the graph data model, using it as part of an offline analytics pipeline or, increasingly, for storing and querying data online, e.g. in a graph database management system (GDBMS). Unfortunately, effective horizontal scaling of this graph data reduces to the NP-Hard problem of “k-way balanced graph partitioning”. Owing to the problem’s importance, several practical approaches exist, producing quality graph partitionings. However, these existing systems are unsuitable for partitioning online graphs, either introducing unnecessary network latency during query processing, being unable to efficiently adapt to changing data and query workloads, or both. In this thesis we propose partitioning techniques which are efficient and sensitive to given query workloads, suitable for application to online graphs and query workloads. To incrementally adapt partitionings in response to workload change, we propose TAPER: a graph repartitioner. TAPER uses novel datastructures to compute the probability of expensive inter -partition traversals (ipt) from each vertex, given the current workload of path queries. Subsequently, it iteratively adjusts an initial partitioning by swapping selected vertices amongst partitions, heuristically maintaining low ipt and high partition quality with respect to that workload. Iterations are inexpensive thanks to time and space optimisations in the underlying datastructures. To incrementally create partitionings in response to graph growth, we propose Loom: a streaming graph partitioner. Loom uses another novel datastructure to detect common patterns of edge traversals when executing a given workload of pattern matching queries. Subsequently, it employs a probabilistic graph isomorphism method to incrementally and efficiently compare sub-graphs in the stream of graph updates, to these common patterns. Matches are assigned within individual partitions if possible, thereby also reducing ipt and increasing partitioning quality w.r.t the given workload. - i - Both partitioner and repartitioner are extensively evaluated with real/synthetic graph datasets and query workloads. The headline results include that TAPER can reduce ipt by upto 80% over a naive existing partitioning and can maintain this reduction in the event of workload change, through additional iterations. Meanwhile, Loom reduces ipt by upto 40% over a state of the art streaming graph partitioner

    Adaptive Learning and Mining for Data Streams and Frequent Patterns

    Get PDF
    Aquesta tesi està dedicada al disseny d'algorismes de mineria de dades per fluxos de dades que evolucionen en el temps i per l'extracció d'arbres freqüents tancats. Primer ens ocupem de cadascuna d'aquestes tasques per separat i, a continuació, ens ocupem d'elles conjuntament, desenvolupant mètodes de classificació de fluxos de dades que contenen elements que són arbres. En el model de flux de dades, les dades arriben a gran velocitat, i els algorismes que els han de processar tenen limitacions estrictes de temps i espai. En la primera part d'aquesta tesi proposem i mostrem un marc per desenvolupar algorismes que aprenen de forma adaptativa dels fluxos de dades que canvien en el temps. Els nostres mètodes es basen en l'ús de mòduls detectors de canvi i estimadors en els llocs correctes. Proposem ADWIN, un algorisme de finestra lliscant adaptativa, per la detecció de canvi i manteniment d'estadístiques actualitzades, i proposem utilitzar-lo com a caixa negra substituint els comptadors en algorismes inicialment no dissenyats per a dades que varien en el temps. Com ADWIN té garanties teòriques de funcionament, això obre la possibilitat d'ampliar aquestes garanties als algorismes d'aprenentatge i de mineria de dades que l'usin. Provem la nostre metodologia amb diversos mètodes d'aprenentatge com el Naïve Bayes, partició, arbres de decisió i conjunt de classificadors. Construïm un marc experimental per fer mineria amb fluxos de dades que varien en el temps, basat en el programari MOA, similar al programari WEKA, de manera que sigui fàcil pels investigadors de realitzar-hi proves experimentals. Els arbres són grafs acíclics connectats i són estudiats com vincles en molts casos. En la segona part d'aquesta tesi, descrivim un estudi formal dels arbres des del punt de vista de mineria de dades basada en tancats. A més, presentem algorismes eficients per fer tests de subarbres i per fer mineria d'arbres freqüents tancats ordenats i no ordenats. S'inclou una anàlisi de l'extracció de regles d'associació de confiança plena dels conjunts d'arbres tancats, on hem trobat un fenomen interessant: les regles que la seva contrapart proposicional és no trivial, són sempre certes en els arbres a causa de la seva peculiar combinatòria. I finalment, usant aquests resultats en fluxos de dades evolutius i la mineria d'arbres tancats freqüents, hem presentat algorismes d'alt rendiment per fer mineria d'arbres freqüents tancats de manera adaptativa en fluxos de dades que evolucionen en el temps. Introduïm una metodologia general per identificar patrons tancats en un flux de dades, utilitzant la Teoria de Reticles de Galois. Usant aquesta metodologia, desenvolupem un algorisme incremental, un basat en finestra lliscant, i finalment un que troba arbres freqüents tancats de manera adaptativa en fluxos de dades. Finalment usem aquests mètodes per a desenvolupar mètodes de classificació per a fluxos de dades d'arbres.This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.Postprint (published version
    corecore