35 research outputs found

    Mining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows

    Get PDF
    Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertain data, or use the sliding window model to assess data streams. Sliding window model uses a fixed-size window to only maintain the most recently inserted data and ignores all previous data (or those that are out of its window). Many real-world applications however require maintaining all inserted or obtained data. Therefore, the question arises that whether other window models can be used to find frequent patterns in dynamic streams of uncertain data.In this paper, we used landmark window model and time-fading model to answer that question. The method presented in the form of proposed algorithm, which uses the idea of landmark window model to find frequent patterns in the relational and uncertain data streams, shows a better performance in finding functional dependencies than other methods in this field. Another advantage of this method compared with other methods is that it shows tuples that do not follow a single dependency. This feature can be used to detect inconsistent data in a data set

    HUPSMT: AN EFFICIENT ALGORITHM FOR MINING HIGH UTILITY-PROBABILITY SEQUENCES IN UNCERTAIN DATABASES WITH MULTIPLE MINIMUM UTILITY THRESHOLDS

    Get PDF
    The problem of high utility sequence mining (HUSM) in quantitative se-quence databases (QSDBs) is more general than that of frequent sequence mining in se-quence databases. An important limitation of HUSM is that a user-predened minimum tility threshold is used commonly to decide if a sequence is high utility. However, this is not convincing in many real-life applications as sequences may have diferent importance. Another limitation of HUSM is that data in QSDBs are assumed to be precise. But in the real world, collected data such as by sensor maybe uncertain. Thus, this paper proposes a framework for mining high utility-probability sequences (HUPSs) in uncertain QSDBs (UQS-DBs) with multiple minimum utility thresholds using a minimum utility. Two new width and depth pruning strategies are also introduced to early eliminate low utility or low probability sequences as well as their extensions, and to reduce sets of candidate items for extensions during the mining process. Based on these strategies, a novel ecient algorithm named HUPSMT is designed for discovering HUPSs. Finally, an experimental study conducted in both real-life and synthetic UQSDBs shows the performance of HUPSMT in terms of time and memory consumption

    A Comprehensive Bibliometric Analysis on Social Network Anonymization: Current Approaches and Future Directions

    Full text link
    In recent decades, social network anonymization has become a crucial research field due to its pivotal role in preserving users' privacy. However, the high diversity of approaches introduced in relevant studies poses a challenge to gaining a profound understanding of the field. In response to this, the current study presents an exhaustive and well-structured bibliometric analysis of the social network anonymization field. To begin our research, related studies from the period of 2007-2022 were collected from the Scopus Database then pre-processed. Following this, the VOSviewer was used to visualize the network of authors' keywords. Subsequently, extensive statistical and network analyses were performed to identify the most prominent keywords and trending topics. Additionally, the application of co-word analysis through SciMAT and the Alluvial diagram allowed us to explore the themes of social network anonymization and scrutinize their evolution over time. These analyses culminated in an innovative taxonomy of the existing approaches and anticipation of potential trends in this domain. To the best of our knowledge, this is the first bibliometric analysis in the social network anonymization field, which offers a deeper understanding of the current state and an insightful roadmap for future research in this domain.Comment: 73 pages, 28 figure

    Bayesian Nonparametric Relational Topic Model through Dependent Gamma Processes

    Full text link
    © 2016 IEEE. Traditional relational topic models provide a successful way to discover the hidden topics from a document network. Many theoretical and practical tasks, such as dimensional reduction, document clustering, and link prediction, could benefit from this revealed knowledge. However, existing relational topic models are based on an assumption that the number of hidden topics is known a priori, which is impractical in many real-world applications. Therefore, in order to relax this assumption, we propose a nonparametric relational topic model using stochastic processes instead of fixed-dimensional probability distributions in this paper. Specifically, each document is assigned a Gamma process, which represents the topic interest of this document. Although this method provides an elegant solution, it brings additional challenges when mathematically modeling the inherent network structure of typical document network, i.e., two spatially closer documents tend to have more similar topics. Furthermore, we require that the topics are shared by all the documents. In order to resolve these challenges, we use a subsampling strategy to assign each document a different Gamma process from the global Gamma process, and the subsampling probabilities of documents are assigned with a Markov Random Field constraint that inherits the document network structure. Through the designed posterior inference algorithm, we can discover the hidden topics and its number simultaneously. Experimental results on both synthetic and real-world network datasets demonstrate the capabilities of learning the hidden topics and, more importantly, the number of topics

    Model-based probabilistic frequent itemset mining

    Get PDF
    Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel methods to capture the itemset mining process as a probability distribution function taking two models into account: the Poisson distribution and the normal distribution. These model-based approaches extract frequent itemsets with a high degree of accuracy and support large databases. We apply our techniques to improve the performance of the algorithms for (1) finding itemsets whose frequentness probabilities are larger than some threshold and (2) mining itemsets with the {Mathematical expression} highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate and four orders of magnitudes faster than previous approaches. In further theoretical and experimental studies, we give an intuition which model-based approach fits best to different types of data sets. © 2012 The Author(s).published_or_final_versio

    Event detection in high throughput social media

    Get PDF

    gPrune: A Constraint Pushing Framework for Graph Pattern Mining

    Get PDF
    Abstract. In graph mining applications, there has been an increasingly strong urge for imposing user-specified constraints on the mining results. However, unlike most traditional itemset constraints, structural constraints, such as density and diameter of a graph, are very hard to be pushed deep into the mining process. In this paper, we give the first comprehensive study on the pruning properties of both traditional and structural constraints aiming to reduce not only the pattern search space but the data search space as well. A new general framework, called gPrune, is proposed to incorporate all the constraints in such a way that they recursively reinforce each other through the entire mining process. A new concept, Pattern-inseparable Data-antimonotonicity, is proposed to handle the structural constraints unique in the context of graph, which, combined with known pruning properties, provides a comprehensive and unified classification framework for structural constraints. The exploration of these antimonotonicities in the context of graph pattern mining is a significant extension to the known classification of constraints, and deepens our understanding of the pruning properties of structural graph constraints.

    Event detection in high throughput social media

    Get PDF

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Network monitoring system using machine learning comparative analysis of classification techniques for network traffic monitoring

    Get PDF
    Çevrimiçi ağ trafiği sınıflandırması, uzun vadeli ilginin odak noktası olmaya devam ediyor. Ağ trafiğini izleme ve Ağ trafiği analizi birçok farklı yoldan yapılabilir. Genellikle, ağ trafiğini izleme, hizmet kalitesi (QoS) ve izinsiz giriş tespiti için ham veri girişi sağla. Özellikle, ağ trafiğini izleme, ağ analistine ağ kaynaklarını nasıl kullandığını anlama ve ağ performansını belirleme olanağı sağlar. Bu bilgi ile ağ analisti, ağ kaynaklarını kontrol etmek ve yönetmek için QoS politikalarını ayarlayabilir. Bu amaca, ağdaki belirli veri tipleri için önceliklerin ayarlanması ve trafiğin yönetmeliklere uyması için günlüğe kaydedilmesi ile ulaşılmaktadır. Ağ trafiğinin izlenmesi akademik araştırma için modeller oluşturmak için kullanılabilir. Bu tezde, en yakın optimizasyona ulaşmak için Karar Ağacı Algoritmasını (DT) kullanarak ve Temel Bileşen Analizi (PCA) Algoritmasını kullanarak ağ trafiğini doğru şekilde sınıflandıran bir makine öğrenme yaklaşımı sunulmaktadır. Makine öğrenimi teknolojisi, yüksek doğrulukta veri madenciliği teknikleri ve ileri istatistiklerin bir sonucu olarak ağ trafiğini izlemek ve sınıflandırmak için daha iyi çözümler üretecektir. Bu tezin amacı, hem çevrimiçi hem de çevrimdışı olarak çalışan modern makine öğrenme teknolojilerini kullanarak bir Ağ İzleme Sistemi (NMS) inşa etmektir. DT algoritması (mevcut veri madenciliği algoritmalarından biri) ağın sınıflandırıcısını oluşturmak için kullanılır. Deney sonuçları, NMS tabanlı sistemin ağ trafiğini başarılı bir şekilde sınıflandırmada %97,7486 doğruluğa (ACC) sahip olduğunu göstermiştir
    corecore