56 research outputs found

    Holistic Cube Analysis: A Query Framework for Data Insights

    Full text link
    We present Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for data insights. We first define AbstractCube, a data type defined as a function from RegionFeatures space to relational tables. AbstractCube provides a logical form of data for HoCA operators and their compositions to operate on to analyze the data. This function-as-data modeling allows us to simultaneously capture a space of non-uniform tables on the co-domain of the function, and region space structure on the domain of the function. We describe two HoCA operators, cube crawling and cube join, which are cube-to-cube transformations (i.e., higher-order functions). Cube crawling explores a region subspace, and outputs a cube mapping regions to signal vectors. Cube join, in turn, allows users to meld information in different cubes, which is critical for composition. The cube crawling interface introduces two novel features: (1) Region Analysis Models (RAMs), which allows one to program and organize analysis on a set of data features into a module. (2) Multi-Model Crawling, which allows one to apply multiple models, potentially on different feature sets, during crawling. These two features, together with cube join and a rich RAM library, allows us to construct succinct HoCA programs to capture a wide variety of data-insight problems in system monitoring, experimentation analysis, and business intelligence. HoCA poses a rich algorithmic design space, such as optimizing crawling performance leveraging region space structure, optimizing cube join performance, and physical designs of cubes. We describe several cube crawling implementations leveraging different foundations (an in-house relational query engine, and Apache Beam), and evaluate their performance characteristics. Finally, we discuss avenues in extending the framework, such as devising more useful HoCA operators.Comment: Establishing core concepts of HoC

    Data mining by means of generalized patterns

    Get PDF
    The thesis is mainly focused on the study and the application of pattern discovery algorithms that aggregate database knowledge to discover and exploit valuable correlations, hidden in the analyzed data, at different abstraction levels. The aim of the research effort described in this work is two-fold: the discovery of associations, in the form of generalized patterns, from large data collections and the inference of semantic models, i.e., taxonomies and ontologies, suitable for driving the mining proces

    Privacy by Design in Data Mining

    Get PDF
    Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”. In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones. This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Quality and interestingness of association rules derived from data mining of relational and semi-structured data

    Get PDF
    Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. As the data mining techniques are data-driven, it is beneficial to affirm the rules using a statistical approach. It is important to establish the ways in which the existing statistical measures and constraint parameters can be effectively utilized and the sequence of their usage.In this thesis, a systematic way to evaluate the association rules discovered from frequent, closed and maximal itemset mining algorithms; and frequent subtree mining algorithm including the rules based on induced, embedded and disconnected subtrees is presented. With reference to the frequent subtree mining, in addition a new direction is explored based on utilizing the DSM approach capable of preserving all information from tree-structured database in a flat data format, consequently enabling the direct application of a wider range of data mining analysis/techniques to tree-structured data. Implications of this approach were investigated and it was found that basing rules on disconnected subtrees, can be useful in terms of increasing the accuracy and the coverage rate of the rule set.A strategy that combines data mining and statistical measurement techniques such as sampling, redundancy and contradictive checks, correlation and regression analysis to evaluate the rules is developed. This framework is then applied to real-world datasets that represent diverse characteristics of data/items. Empirical results show that with a proper combination of data mining and statistical analysis, the proposed framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy rules. Moreover, the results reveal the important characteristics and differences between mining frequent, closed or maximal itemsets; and mining frequent subtree including the rules based on induced, embedded and disconnected subtrees; as well as the impact of confidence measure for the prediction and classification task

    Event detection in high throughput social media

    Get PDF

    Tree algorithms for mining association rules

    Get PDF
    With the increasing reliability of digital communication, the falling cost of hardware and increased computational power, the gathering and storage of data has become easier than at any other time in history. Commercial and public agencies are able to hold extensive records about all aspects of their operations. Witness the proliferation of point of sale (POS) transaction recording within retailing, digital storage of census data and computerized hospital records. Whilst the gathering of such data has uses in terms of answering specific queries and allowing visulisation of certain trends the volumes of data can hide significant patterns that would be impossible to locate manually. These patterns, once found, could provide an insight into customer behviour, demographic shifts and patient diagnosis hitherto unseen and unexpected. Remaining competitive in a modem business environment, or delivering services in a timely and cost effective manner for public services is a crucial part of modem economics. Analysis of the data held by an organisaton, by a system that "learns" can allow predictions to be made based on historical evidence. Users may guide the process but essentially the software is exploring the data unaided. The research described within this thesis develops current ideas regarding the exploration of large data volumes. Particular areas of research are the reduction of the search space within the dataset and the generation of rules which are deduced from the patterns within the data. These issues are discussed within an experimental framework which extracts information from binary data

    Front Matter - Soft Computing for Data Mining Applications

    Get PDF
    Efficient tools and algorithms for knowledge discovery in large data sets have been devised during the recent years. These methods exploit the capability of computers to search huge amounts of data in a fast and effective manner. However, the data to be analyzed is imprecise and afflicted with uncertainty. In the case of heterogeneous data sources such as text, audio and video, the data might moreover be ambiguous and partly conflicting. Besides, patterns and relationships of interest are usually vague and approximate. Thus, in order to make the information mining process more robust or say, human-like methods for searching and learning it requires tolerance towards imprecision, uncertainty and exceptions. Thus, they have approximate reasoning capabilities and are capable of handling partial truth. Properties of the aforementioned kind are typical soft computing. Soft computing techniques like Genetic

    Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data. The core step of the KDD process is the application of Data Mining (DM) algorithms to efficiently find interesting patterns in large databases. This thesis concerns itself with three inter-related themes: Generalised interaction and rule mining; the incorporation of statistics into novel data mining approaches; and probabilistic frequent pattern mining in uncertain databases. An interaction describes an effect that variables have -- or appear to have -- on each other. Interaction mining is the process of mining structures on variables describing their interaction patterns -- usually represented as sets, graphs or rules. Interactions may be complex, represent both positive and negative relationships, and the presence of interactions can influence another interaction or variable in interesting ways. Finding interactions is useful in domains ranging from social network analysis, marketing, the sciences, e-commerce, to statistics and finance. Many data mining tasks may be considered as mining interactions, such as clustering; frequent itemset mining; association rule mining; classification rules; graph mining; flock mining; etc. Interaction mining problems can have very different semantics, pattern definitions, interestingness measures and data types. Solving a wide range of interaction mining problems at the abstract level, and doing so efficiently -- ideally more efficiently than with specialised approaches, is a challenging problem. This thesis introduces and solves the Generalised Interaction Mining (GIM) and Generalised Rule Mining (GRM) problems. GIM and GRM use an efficient and intuitive computational model based purely on vector valued functions. The semantics of the interactions, their interestingness measures and the type of data considered are flexible components of vectorised frameworks. By separating the semantics of a problem from the algorithm used to mine it, the frameworks allow both to vary independently of each other. This makes it easier to develop new methods by focusing purely on a problem's semantics and removing the burden of designing an efficient algorithm. By encoding interactions as vectors in the space (or a sub-space) of samples, they provide an intuitive geometric interpretation that inspires novel methods. By operating in time linear in the number of interesting interactions that need to be examined, the GIM and GRM algorithms are optimal. The use of GRM or GIM provides efficient solutions to a range of problems in this thesis, including graph mining, counting based methods, itemset mining, clique mining, a clustering problem, complex pattern mining, negative pattern mining, solving an optimisation problem, spatial data mining, probabilistic itemset mining, probabilistic association rule mining, feature selection and generation, classification and multiplication rule mining. Data mining is a hypothesis generating endeavour, examining large databases for patterns suggesting novel and useful knowledge to the user. Since the database is a sample, the patterns found should describe hypotheses about the underlying process generating the data. In searching for these patterns, a DM algorithm makes additional hypothesis when it prunes the search space. Natural questions to ask then, are: "Does the algorithm find patterns that are statistically significant?" and "Did the algorithm make significant decisions during its search?". Such questions address the quality of patterns found though data mining and the confidence that a user can have in utilising them. Finally, statistics has a range of useful tools and measures that are applicable in data mining. In this context, this thesis incorporates statistical techniques -- in particular, non-parametric significance tests and correlation -- directly into novel data mining approaches. This idea is applied to statistically significant and relatively class correlated rule based classification of imbalanced data sets; significant frequent itemset mining; mining complex correlation structures between variables for feature selection; mining correlated multiplication rules for interaction mining and feature generation; and conjunctive correlation rules for classification. The application of GIM or GRM to these problems lead to efficient and intuitive solutions. Frequent itemset mining (FIM) is a fundamental problem in data mining. While it is usually assumed that the items occurring in a transaction are known for certain, in many applications the data is inherently noisy or probabilistic; such as adding noise in privacy preserving data mining applications, aggregation or grouping of records leading to estimated purchase probabilities, and databases capturing naturally uncertain phenomena. The consideration of existential uncertainty of item(sets) makes traditional techniques inapplicable. Prior to the work in this thesis, itemsets were mined if their expected support is high. This returns only an estimate, ignores the probability distribution of support, provides no confidence in the results, and can lead to scenarios where itemsets are labeled frequent even if they are more likely to be infrequent. Clearly, this is undesirable. This thesis proposes and solves the Probabilistic Frequent Itemset Mining (PFIM) problem, where itemsets are considered interesting if the probability that they are frequent is high. The problem is solved under the possible worlds model and a proposed probabilistic framework for PFIM. Novel and efficient methods are developed for computing an itemset's exact support probability distribution and frequentness probability, using the Poisson binomial recurrence, generating functions, or a Normal approximation. Incremental methods are proposed to answer queries such as finding the top-k probabilistic frequent itemsets. A number of specialised PFIM algorithms are developed, with each being more efficient than the last: ProApriori is the first solution to PFIM and is based on candidate generation and testing. ProFP-Growth is the first probabilistic FP-Growth type algorithm and uses a proposed probabilistic frequent pattern tree (Pro-FPTree) to avoid candidate generation. Finally, the application of GIM leads to GIM-PFIM; the fastest known algorithm for solving the PFIM problem. It achieves orders of magnitude improvements in space and time usage, and leads to an intuitive subspace and probability-vector based interpretation of PFIM.Knowledge Discovery in Datenbanken (KDD) ist der nicht-triviale Prozess, gültiges, neues, potentiell nützliches und letztendlich verständliches Wissen aus großen Datensätzen zu extrahieren. Der wichtigste Schritt im KDD Prozess ist die Anwendung effizienter Data Mining (DM) Algorithmen um interessante Muster ("Patterns") in Datensätzen zu finden. Diese Dissertation beschäftigt sich mit drei verwandten Themen: Generalised Interaction und Rule Mining, die Einbindung von statistischen Methoden in neue DM Algorithmen und Probabilistic Frequent Itemset Mining (PFIM) in unsicheren Daten. Eine Interaktion ("Interaction") beschreibt den Einfluss, den Variablen aufeinander haben. Interaktionsmining ist der Prozess, Strukturen zwischen Variablen zu finden, die Interaktionsmuster beschreiben. Diese werden gewöhnlicherweise als Mengen, Graphen oder Regeln repräsentiert. Interaktionen können komplex sein und sowohl positive als auch negative Beziehungen repräsentieren. Außerdem kann das Vorhandensein von Interaktionen andere Interaktionen oder Variablen beeinflussen. Interaktionen stellen in Bereichen wie Soziale Netzwerk Analyse, Marketing, Wissenschaft, E-commerce, Statistik und Finanz wertvolle Information dar. Viele DM Methoden können als Interaktionsmining betrachtet werden: Zum Beispiel Clustering, Frequent Itemset Mining, Assoziationsregeln, Klassifikationsregeln, Graph Mining, Flock Mining, usw. Interaktionsmining-Probleme können sehr unterschiedliche Semantik, Musterdefinitionen, Interessantheitsmaße und Datentypen erfordern. Interaktionsmining-Probleme auf breiter und abstrakter Basis effizient -- und im Idealfall effizienter als mit spezialisierten Methoden -- zu lösen, ist ein herausforderndes Problem. Diese Dissertation führt das Generalised Interaction Mining (GIM) und das Generalised Rule Mining (GRM) Problem ein und beschreibt Lösungen für diese. GIM und GRM benutzen ein effizientes und intuitives Berechnungsmodell, das einzig und allein auf vektorbasierten Funktionen beruht. Die Semantik der Interaktionen, ihre Interessantheitsmaße und die Datenarten, sind Komponenten in vektorisierten Frameworks. Die Frameworks ermöglichen die Trennung der Problemsemantik vom Algorithmus, so dass beide unabhängig voneinander geändert werden können. Die Entwicklung neuer Methoden wird dadurch erleichtert, da man sich völlig auf die Problemsemantik fokussieren kann und sich nicht mit der Entwicklung problemspezifischer Algorithmen befassen muss. Die Kodierung der Interaktionen als Vektoren im gesamten Raum (oder Teilraum) der Stichproben stellt eine intuitive geometrische Interpretation dar, die neuartige Methoden inspiriert. Die GRM- und GIM- Algorithmen haben lineare Laufzeit in der Anzahl der Interaktionen die geprüft werden müssen und sind somit optimal. Die Anwendung von GRM oder GIM in dieser Dissertation ermöglicht effiziente Lösungen für eine Reihe von Problemen, wie zum Beispiel Graph Mining, Aufzählungsmethoden, Itemset Mining, Clique Mining, ein Clusteringproblem, das Finden von komplexen und negativen Mustern, die Lösung von Optimierungsproblemen, Spatial Data Mining, probabilistisches Itemset Mining, probabilistisches Mining von Assoziationsregel, Selektion und Erzeugung von Features, Mining von Klassifikations- und Multiplikationsregel, u.v.m. Data Mining ist ein Verfahren, das Hypothesen produziert, indem es in großen Datensätzen Muster findet und damit für den Anwender neues und nützliches Wissen vorschlägt. Da die untersuchte Datenbank ein Resultat des datenerzeugenden Prozesses ist, sollten die gefundenen Muster Erkenntnisse über diesen Prozess liefern. Bei der Suche nach diesen Mustern macht ein DM Algorithmus zusätzliche Hypothesen, wenn Teile des Suchraums ausgeschlossen werden. Die folgenden Fragen sind dabei wichtig: "Findet der Algorithmus statistisch signifikante Muster?" und "Hat der Algorithmus während des Suchprozesses signifikante Entscheidungen getroffen?". Diese Fragen beeinflussen die Qualität der Muster und die Sicherheit die der Anwender in ihrer Benutzung haben kann. Da die Statistik auch eine Reihe von nützlichen Methoden bereitstellt, die für DM anwendbar sind, kombiniert diese Dissertation einige statistische Methoden mit neuen DM Algorithmen, insbesondere nicht-parametrische Signifikanztests und Korrelation. Diese Idee wird für die folgenden Probleme angewandt: Signifikante und "relatively class correlated" regelbasierte Klassifikation in unsymmetrischen Datensätzen, signifikantes Frequent Itemset Mining, Mining von komplizierten Korrelationsstrukturen zwischen Variablen zum Zweck der Featureselektion, Mining von korrelierten Multiplikationsregeln zum Zwecke des Interaktionsminings und Featureerzeugung und konjunktive Korrelationsregeln für die Klassifikation. Die Anwendung von GIM und GRM auf diese Probleme führt zu effizienten und intuitiven Lösungen. Frequent Itemset Mining (FIM) ist ein fundamentales Problem im Data Mining. Obwohl allgemein die Annahme gilt, dass in einer Transaktion enthaltene Items bekannt sind, sind die Daten in vielen Anwendungen unsicher oder probabilistisch. Beispiele sind das Hinzufügen von Rauschen zu Datenschutzzwecken, die Gruppierung von Datensätzen die zu geschätzten Kaufwahrscheinlichkeiten führen und Datensätze deren Herkunft von Natur aus unsicher sind. Die Berücksichtigung von unsicheren Datensätzen verhindert die Anwendung von traditionellen Methoden. Vor der Arbeit in dieser Dissertation wurden Itemsets gesucht, deren erwartetes Vorkommen hoch ist. Diese Methode produziert jedoch nur Schätzwerte, vernachlässigt die Wahrscheinlichkeitsverteilung der Vorkommen, bietet keine Sicherheit für die Genauigkeit der Ergebnisse und kann zu Szenarien führen in denen das Vorkommen als häufig eingestuft wird, obwohl die Wahrscheinlichkeit höher ist, dass sie nur selten vorkommen. Solche Ergebnisse sind natürlich unerwünscht. Diese Dissertation führt das Probabilistic Frequent Itemset Mining (PFIM) ein. Diese Lösung betrachtet Itemsets als interessant, wenn die Wahrscheinlichkeit groß ist, dass sie häufig vorkommen. Die Problemlösung besteht aus der Anwendung des Possible Worlds Models und dem vorgeschlagenen probabilistisches Framework für PFIM. Es werden neue und effiziente Methoden entwickelt um die Wahrscheinlichkeitsverteilung des Vorkommens und die Häufigkeitsverteilung eines Itemsets zu berechnen. Dazu werden die Poisson Binomial Recurrence, Generating Functions, oder eine normalverteilte Annäherung verwendet. Inkrementelle Methoden werden vorgeschlagen um Fragen wie "Finde die top-k Probabilistic Frequent Itemsets" zu beantworten. Mehrere PFIM Algorithmen werden entwickelt, wobei die Effizienz von Algorithmus zu Algorithmus steigt: ProApriori ist die erste Lösung für PFIM und basiert auf erzeugen und testen von Kandidaten. ProFP-Growth ist der erste probabilistische FP-Growth Algorithmus. Er schlägt einen Probabilistic Frequent Pattern Tree (Pro-FPTree) vor, der Kandidatenerzeugung überflüssig macht. Die Anwendung von GIM führt schließlich zu GIM-PFIM, dem schnellsten bekannten Algorithmus zur Lösung des PFIM Problems. Dieser Algorithmus resultiert in einem um Größenordnungen besseren Zeit- und Speicherbedarf, und führt zu einer intuitiven Interpretation von PFIM, basierend auf Unterräumen und Wahrscheinlichkeitsvektoren

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web
    corecore