104 research outputs found

    Knowledge discovery techniques for transactional data model

    Get PDF
    In this work we give solutions to two key knowledge discovery problems for the Transactional Data model: Cluster analysis and Itemset mining. By knowledge discovery in context of these two problems, we specifically mean novel and useful ways of extracting clusters and itemsets from transactional data. Transactional Data model is widely used in a variety of applications. In cluster analysis the goal is to find clusters of similar transactions in the data with the collective properties of each cluster being unique. We propose the first clustering algorithm for transactional data which uses the latest model definition. All previously proposed algorithms did not use the important utility information in the data. Our novel technique effectively solves this problem. We also propose two new cluster validation metrics based on the criterion of high utility patterns. When comparing our technique with competing algorithms, we miss much fewer high utility patterns of importance than them. Itemset mining is the problem of searching for repeating patterns of high importance in the data. We show that the current model for itemset mining leads to information loss. It ignores the presence of clusters in the data. We propose a new itemset mining model which incorporates the cluster structure information. This allows the model to make predictions for future itemsets. We show that our model makes accurate predictions successfully, by discovering 30-40% future itemsets in most experiments on two benchmark datasets with negligible inaccuracies. There are no other present itemset prediction models, so accurate prediction is an accomplishment of ours. We provide further theoretical improvements in our model by making it capable of giving predictions for specific future windows by using time series forecasting. We also perform a detailed analysis of various clustering algorithms and study the effect of the Big Data phenomenon on them. This inspired us to further refine our model based on a classification problem design. This addition allows the mining of itemsets based on maximizing a customizable objective function made of different prediction metrics. The final framework design proposed by us is the first of its kind to make itemset predictions by using the cluster structure. It is capable of adapting the predictions to a specific future window and customizes the mining process to any specified prediction criterion. We create an implementation of the framework on a Web analytics data set, and notice that it successfully makes optimal prediction configuration choices with a high accuracy of 0.895

    A study of two problems in data mining: projective clustering and multiple tables association rules mining.

    Get PDF
    Ng Ka Ka.Thesis (M.Phil.)--Chinese University of Hong Kong, 2002.Includes bibliographical references (leaves 114-120).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.viiChapter I --- Projective Clustering --- p.1Chapter 1 --- Introduction to Projective Clustering --- p.2Chapter 2 --- Related Work to Projective Clustering --- p.7Chapter 2.1 --- CLARANS - Graph Abstraction and Bounded Optimization --- p.8Chapter 2.1.1 --- Graph Abstraction --- p.8Chapter 2.1.2 --- Bounded Optimized Random Search --- p.9Chapter 2.2 --- OptiGrid ´ؤ Grid Partitioning Approach and Density Estimation Function --- p.9Chapter 2.2.1 --- Empty Space Phenomenon --- p.10Chapter 2.2.2 --- Density Estimation Function --- p.11Chapter 2.2.3 --- Upper Bound Property --- p.12Chapter 2.3 --- CLIQUE and ENCLUS - Subspace Clustering --- p.13Chapter 2.3.1 --- Monotonicity Property of Subspaces --- p.14Chapter 2.4 --- PROCLUS Projective Clustering --- p.15Chapter 2.5 --- ORCLUS - Generalized Projective Clustering --- p.16Chapter 2.5.1 --- Singular Value Decomposition SVD --- p.17Chapter 2.6 --- "An ""Optimal"" Projective Clustering" --- p.17Chapter 3 --- EPC : Efficient Projective Clustering --- p.19Chapter 3.1 --- Motivation --- p.19Chapter 3.2 --- Notations and Definitions --- p.21Chapter 3.2.1 --- Density Estimation Function --- p.22Chapter 3.2.2 --- 1-d Histogram --- p.23Chapter 3.2.3 --- 1-d Dense Region --- p.25Chapter 3.2.4 --- Signature Q --- p.26Chapter 3.3 --- The overall framework --- p.28Chapter 3.4 --- Major Steps --- p.30Chapter 3.4.1 --- Histogram Generation --- p.30Chapter 3.4.2 --- Adaptive discovery of dense regions --- p.31Chapter 3.4.3 --- Count the occurrences of signatures --- p.36Chapter 3.4.4 --- Find the most frequent signatures --- p.36Chapter 3.4.5 --- Refine the top 3m signatures --- p.37Chapter 3.5 --- Time and Space Complexity --- p.38Chapter 4 --- EPCH: An extension and generalization of EPC --- p.40Chapter 4.1 --- Motivation of the extension --- p.40Chapter 4.2 --- Distinguish clusters by their projections in different subspaces --- p.43Chapter 4.3 --- EPCH: a generalization of EPC by building histogram with higher dimensionality --- p.46Chapter 4.3.1 --- Multidimensional histograms construction and dense re- gions detection --- p.46Chapter 4.3.2 --- Compressing data objects to signatures --- p.47Chapter 4.3.3 --- Merging Similar Signature Entries --- p.49Chapter 4.3.4 --- Associating membership degree --- p.51Chapter 4.3.5 --- The choice of Dimensionality d of the Histogram --- p.52Chapter 4.4 --- Implementation of EPC2 --- p.53Chapter 4.5 --- Time and Space Complexity of EPCH --- p.54Chapter 5 --- Experimental Results --- p.56Chapter 5.1 --- Clustering Quality Measurement --- p.56Chapter 5.2 --- Synthetic Data Generation --- p.58Chapter 5.3 --- Experimental setup --- p.59Chapter 5.4 --- Comparison between EPC and PROCULS --- p.60Chapter 5.5 --- Comparison between EPCH and ORCLUS --- p.62Chapter 5.5.1 --- Dimensionality of the original space and the associated subspace --- p.65Chapter 5.5.2 --- Projection not parallel to original axes --- p.66Chapter 5.5.3 --- Data objects belong to more than one cluster under fuzzy clustering --- p.67Chapter 5.6 --- Scalability of EPC --- p.68Chapter 5.7 --- Scalability of EPC2 --- p.69Chapter 6 --- Conclusion --- p.71Chapter II --- Multiple Tables Association Rules Mining --- p.74Chapter 7 --- Introduction to Multiple Tables Association Rule Mining --- p.75Chapter 7.1 --- Problem Statement --- p.77Chapter 8 --- Related Work to Multiple Tables Association Rules Mining --- p.80Chapter 8.1 --- Aprori - A Bottom-up approach to generate candidate sets --- p.80Chapter 8.2 --- VIPER - Vertical Mining with various optimization techniques --- p.81Chapter 8.2.1 --- Vertical TID Representation and Mining --- p.82Chapter 8.2.2 --- FORC --- p.83Chapter 8.3 --- Frequent Itemset Counting across Multiple Tables --- p.84Chapter 9 --- The Proposed Method --- p.85Chapter 9.1 --- Notations --- p.85Chapter 9.2 --- Converting Dimension Tables to internal representation --- p.87Chapter 9.3 --- The idea of discovering frequent itemsets without joining --- p.89Chapter 9.4 --- Overall Steps --- p.91Chapter 9.5 --- Binding multiple Dimension Tables --- p.92Chapter 9.6 --- Prefix Tree for FT --- p.94Chapter 9.7 --- Maintaining frequent itemsets in FI-trees --- p.96Chapter 9.8 --- Frequency Counting --- p.99Chapter 10 --- Experiments --- p.102Chapter 10.1 --- Synthetic Data Generation --- p.102Chapter 10.2 --- Experimental Findings --- p.106Chapter 11 --- Conclusion and Future Works --- p.112Bibliography --- p.11

    Similarity processing in multi-observation data

    Get PDF
    Many real-world application domains such as sensor-monitoring systems for environmental research or medical diagnostic systems are dealing with data that is represented by multiple observations. In contrast to single-observation data, where each object is assigned to exactly one occurrence, multi-observation data is based on several occurrences that are subject to two key properties: temporal variability and uncertainty. When defining similarity between data objects, these properties play a significant role. In general, methods designed for single-observation data hardly apply for multi-observation data, as they are either not supported by the data models or do not provide sufficiently efficient or effective solutions. Prominent directions incorporating the key properties are the fields of time series, where data is created by temporally successive observations, and uncertain data, where observations are mutually exclusive. This thesis provides research contributions for similarity processing - similarity search and data mining - on time series and uncertain data. The first part of this thesis focuses on similarity processing in time series databases. A variety of similarity measures have recently been proposed that support similarity processing w.r.t. various aspects. In particular, this part deals with time series that consist of periodic occurrences of patterns. Examining an application scenario from the medical domain, a solution for activity recognition is presented. Finally, the extraction of feature vectors allows the application of spatial index structures, which support the acceleration of search and mining tasks resulting in a significant efficiency gain. As feature vectors are potentially of high dimensionality, this part introduces indexing approaches for the high-dimensional space for the full-dimensional case as well as for arbitrary subspaces. The second part of this thesis focuses on similarity processing in probabilistic databases. The presence of uncertainty is inherent in many applications dealing with data collected by sensing devices. Often, the collected information is noisy or incomplete due to measurement or transmission errors. Furthermore, data may be rendered uncertain due to privacy-preserving issues with the presence of confidential information. This creates a number of challenges in terms of effectively and efficiently querying and mining uncertain data. Existing work in this field either neglects the presence of dependencies or provides only approximate results while applying methods designed for certain data. Other approaches dealing with uncertain data are not able to provide efficient solutions. This part presents query processing approaches that outperform existing solutions of probabilistic similarity ranking. This part finally leads to the application of the introduced techniques to data mining tasks, such as the prominent problem of probabilistic frequent itemset mining.Viele Anwendungsgebiete, wie beispielsweise die Umweltforschung oder die medizinische Diagnostik, nutzen Systeme der Sensorüberwachung. Solche Systeme müssen oftmals in der Lage sein, mit Daten umzugehen, welche durch mehrere Beobachtungen repräsentiert werden. Im Gegensatz zu Daten mit nur einer Beobachtung (Single-Observation Data) basieren Daten aus mehreren Beobachtungen (Multi-Observation Data) auf einer Vielzahl von Beobachtungen, welche zwei Schlüsseleigenschaften unterliegen: Zeitliche Veränderlichkeit und Datenunsicherheit. Im Bereich der Ähnlichkeitssuche und im Data Mining spielen diese Eigenschaften eine wichtige Rolle. Gängige Lösungen in diesen Bereichen, die für Single-Observation Data entwickelt wurden, sind in der Regel für den Umgang mit mehreren Beobachtungen pro Objekt nicht anwendbar. Der Grund dafür liegt darin, dass diese Ansätze entweder nicht mit den Datenmodellen vereinbar sind oder keine Lösungen anbieten, die den aktuellen Ansprüchen an Lösungsqualität oder Effizienz genügen. Bekannte Forschungsrichtungen, die sich mit Multi-Observation Data und deren Schlüsseleigenschaften beschäftigen, sind die Analyse von Zeitreihen und die Ähnlichkeitssuche in probabilistischen Datenbanken. Während erstere Richtung eine zeitliche Ordnung der Beobachtungen eines Objekts voraussetzt, basieren unsichere Datenobjekte auf Beobachtungen, die sich gegenseitig bedingen oder ausschließen. Diese Dissertation umfasst aktuelle Forschungsbeiträge aus den beiden genannten Bereichen, wobei Methoden zur Ähnlichkeitssuche und zur Anwendung im Data Mining vorgestellt werden. Der erste Teil dieser Arbeit beschäftigt sich mit Ähnlichkeitssuche und Data Mining in Zeitreihendatenbanken. Insbesondere werden Zeitreihen betrachtet, welche aus periodisch auftretenden Mustern bestehen. Im Kontext eines medizinischen Anwendungsszenarios wird ein Ansatz zur Aktivitätserkennung vorgestellt. Dieser erlaubt mittels Merkmalsextraktion eine effiziente Speicherung und Analyse mit Hilfe von räumlichen Indexstrukturen. Für den Fall hochdimensionaler Merkmalsvektoren stellt dieser Teil zwei Indexierungsmethoden zur Beschleunigung von ähnlichkeitsanfragen vor. Die erste Methode berücksichtigt alle Attribute der Merkmalsvektoren, während die zweite Methode eine Projektion der Anfrage auf eine benutzerdefinierten Unterraum des Vektorraums erlaubt. Im zweiten Teil dieser Arbeit wird die Ähnlichkeitssuche im Kontext probabilistischer Datenbanken behandelt. Daten aus Sensormessungen besitzen häufig Eigenschaften, die einer gewissen Unsicherheit unterliegen. Aufgrund von Mess- oder übertragungsfehlern sind gemessene Werte oftmals unvollständig oder mit Rauschen behaftet. In diversen Szenarien, wie beispielsweise mit persönlichen oder medizinisch vertraulichen Daten, können Daten auch nachträglich von Hand verrauscht werden, so dass eine genaue Rekonstruktion der ursprünglichen Informationen nicht möglich ist. Diese Gegebenheiten stellen Anfragetechniken und Methoden des Data Mining vor einige Herausforderungen. In bestehenden Forschungsarbeiten aus dem Bereich der unsicheren Datenbanken werden diverse Probleme oftmals nicht beachtet. Entweder wird die Präsenz von Abhängigkeiten ignoriert, oder es werden lediglich approximative Lösungen angeboten, welche die Anwendung von Methoden für sichere Daten erlaubt. Andere Ansätze berechnen genaue Lösungen, liefern die Antworten aber nicht in annehmbarer Laufzeit zurück. Dieser Teil der Arbeit präsentiert effiziente Methoden zur Beantwortung von Ähnlichkeitsanfragen, welche die Ergebnisse absteigend nach ihrer Relevanz, also eine Rangliste der Ergebnisse, zurückliefern. Die angewandten Techniken werden schließlich auf Problemstellungen im probabilistischen Data Mining übertragen, um beispielsweise das Problem des Frequent Itemset Mining unter Berücksichtigung des vollen Gehalts an Unsicherheitsinformation zu lösen

    Fuzzy-Granular Based Data Mining for Effective Decision Support in Biomedical Applications

    Get PDF
    Due to complexity of biomedical problems, adaptive and intelligent knowledge discovery and data mining systems are highly needed to help humans to understand the inherent mechanism of diseases. For biomedical classification problems, typically it is impossible to build a perfect classifier with 100% prediction accuracy. Hence a more realistic target is to build an effective Decision Support System (DSS). In this dissertation, a novel adaptive Fuzzy Association Rules (FARs) mining algorithm, named FARM-DS, is proposed to build such a DSS for binary classification problems in the biomedical domain. Empirical studies show that FARM-DS is competitive to state-of-the-art classifiers in terms of prediction accuracy. More importantly, FARs can provide strong decision support on disease diagnoses due to their easy interpretability. This dissertation also proposes a fuzzy-granular method to select informative and discriminative genes from huge microarray gene expression data. With fuzzy granulation, information loss in the process of gene selection is decreased. As a result, more informative genes for cancer classification are selected and more accurate classifiers can be modeled. Empirical studies show that the proposed method is more accurate than traditional algorithms for cancer classification. And hence we expect that genes being selected can be more helpful for further biological studies

    State of the Art in Privacy Preserving Data Mining

    Get PDF
    Privacy is one of the most important properties an information system must satisfy. A relatively new trend shows that classical access control techniques are not sufficient to guarantee privacy when Data Mining techniques are used. Such a trend, especially in the context of public databases, or in the context of sensible information related to critical infrastructures, represents, nowadays a not negligible thread. Privacy Preserving Data Mining (PPDM) algorithms have been recently introduced with the aim of modifying the database in such a way to prevent the discovery of sensible information. This is a very complex task and there exist in the scientific literature some different approaches to the problem. In this work we present a "Survey" of the current PPDM methodologies which seem promising for the future.JRC.G.6-Sensors, radar technologies and cybersecurit

    PGLCM: Efficient Parallel Mining of Closed Frequent Gradual Itemsets

    Get PDF
    International audienceNumerical data (e.g., DNA micro-array data, sensor data) pose a challenging problem to existing frequent pattern mining methods which hardly handle them. In this framework, gradual patterns have been recently proposed to extract covariations of attributes, such as: "When X increases, Y decreases". There exist some algorithms for mining frequent gradual patterns, but they cannot scale to real-world databases. We present in this paper GLCM, the first algorithm for mining closed frequent gradual patterns, which proposes strong complexity guarantees: the mining time is linear with the number of closed frequent gradual item sets. Our experimental study shows that GLCM is two orders of magnitude faster than the state of the art, with a constant low memory usage. We also present PGLCM, a parallelization of GLCM capable of exploiting multicore processors, with good scale-up properties on complex datasets. These algorithms are the first algorithms capable of mining large real world datasets to discover gradual patterns

    {MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization

    No full text
    Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior
    • …
    corecore