15 research outputs found

    Pattern mining under different conditions

    Get PDF
    New requirements and demands on pattern mining arise in modern applications, which cannot be fulfilled using conventional methods. For example, in scientific research, scientists are more interested in unknown knowledge, which usually hides in significant but not frequent patterns. However, existing itemset mining algorithms are designed for very frequent patterns. Furthermore, scientists need to repeat an experiment many times to ensure reproducibility. A series of datasets are generated at once, waiting for clustering, which can contain an unknown number of clusters with various densities and shapes. Using existing clustering algorithms is time-consuming because parameter tuning is necessary for each dataset. Many scientific datasets are extremely noisy. They contain considerably more noises than in-cluster data points. Most existing clustering algorithms can only handle noises up to a moderate level. Temporal pattern mining is also important in scientific research. Existing temporal pattern mining algorithms only consider pointbased events. However, most activities in the real-world are interval-based with a starting and an ending timestamp. This thesis developed novel pattern mining algorithms for various data mining tasks under different conditions. The first part of this thesis investigates the problem of mining less frequent itemsets in transactional datasets. In contrast to existing frequent itemset mining algorithms, this part focus on itemsets that occurred not that frequent. Algorithms NIIMiner, RaCloMiner, and LSCMiner are proposed to identify such kind of itemsets efficiently. NIIMiner utilizes the negative itemset tree to extract all patterns that occurred less than a given support threshold in a top-down depth-first manner. RaCloMiner combines existing bottom-up frequent itemset mining algorithms with a top-down itemset mining algorithm to achieve a better performance in mining less frequent patterns. LSCMiner investigates the problem of mining less frequent closed patterns. The second part of this thesis studied the problem of interval-based temporal pattern mining in the stream environment. Interval-based temporal patterns are sequential patterns in which each event is aligned with a starting and ending temporal information. The ability to handle interval-based events and stream data is lacking in existing approaches. A novel intervalbased temporal pattern mining algorithm for stream data is described in this part. The last part of this thesis studies new problems in clustering on numeric datasets. The first problem tackled in this part is shape alternation adaptivity in clustering. In applications such as scientific data analysis, scientists need to deal with a series of datasets generated from one experiment. Cluster sizes and shapes are different in those datasets. A kNN density-based clustering algorithm, kadaClus, is proposed to provide the shape alternation adaptability so that users do not need to tune parameters for each dataset. The second problem studied in this part is clustering in an extremely noisy dataset. Many real-world datasets contain considerably more noises than in-cluster data points. A novel clustering algorithm, kenClus, is proposed to identify clusters in arbitrary shapes from extremely noisy datasets. Both clustering algorithms are kNN-based, which only require one parameter k. In each part, the efficiency and effectiveness of the presented techniques are thoroughly analyzed. Intensive experiments on synthetic and real-world datasets are conducted to show the benefits of the proposed algorithms over conventional approaches

    A Formal Concept Analysis Approach to Association Rule Mining: The QuICL Algorithms

    Get PDF
    Association rule mining (ARM) is the task of identifying meaningful implication rules exhibited in a data set. Most research has focused on extracting frequent item (FI) sets and thus fallen short of the overall ARM objective. The FI miners fail to identify the upper covers that are needed to generate a set of association rules whose size can be exploited by an end user. An alternative to FI mining can be found in formal concept analysis (FCA), a branch of applied mathematics. FCA derives a concept lattice whose concepts identify closed FI sets and connections identify the upper covers. However, most FCA algorithms construct a complete lattice and therefore include item sets that are not frequent. An iceberg lattice, on the other hand, is a concept lattice whose concepts contain only FI sets. Only three algorithms to construct an iceberg lattice were found in literature. Given that an iceberg concept lattice provides an analysis tool to succinctly identify association rules, this study investigated additional algorithms to construct an iceberg concept lattice. This report presents the development and analysis of the Quick Iceberg Concept Lattice (QuICL) algorithms. These algorithms provide incremental construction of an iceberg lattice. QuICL uses recursion instead of iteration to navigate the lattice and establish connections, thereby eliminating costly processing incurred by past algorithms. The QuICL algorithms were evaluated against leading FI miners and FCA construction algorithms using benchmarks cited in literature. Results demonstrate that QuICL provides performance on the order of FI miners yet additionally derive the upper covers. QuICL, when combined with known algorithms to extract a basis of association rules from a lattice, offer a best known ARM solution. Beyond this, the QuICL algorithms have proved to be very efficient, providing an order of magnitude gains over other incremental lattice construction algorithms. For example, on the Mushroom data set, QuICL completes in less than 3 seconds. Past algorithms exceed 200 seconds. On T10I4D100k, QuICL completes in less than 120 seconds. Past algorithms approach 10,000 seconds. QuICL is proved to be the best known all around incremental lattice construction algorithm. Runtime complexity is shown to be O(l d i) where l is the cardinality of the lattice, d is the average degree of the lattice, and i is a mean function on the frequent item extents

    Prediction assisted fast handovers for seamless IP mobility

    Get PDF
    Word processed copy.Includes bibliographical references (leaves 94-98).This research investigates the techniques used to improve the standard Mobile IP handover process and provide proactivity in network mobility management. Numerous fast handover proposals in the literature have recently adopted a cross-layer approach to enhance movement detection functionality and make terminal mobility more seamless. Such fast handover protocols are dependent on an anticipated link-layer trigger or pre-trigger to perform pre-handover service establishment operations. This research identifies the practical difficulties involved in implementing this type of trigger and proposes an alternative solution that integrates the concept of mobility prediction into a reactive fast handover scheme

    Improving the Timeliness, Accuracy, and Completeness of Mortality Reporting Using FHIR Apps and Machine Learning

    Get PDF
    There are approximately 56 million deaths per year world-wide, with millions happening in the United States. Accurate and timely mortality reporting is essential for gathering this important public health data in order to formulate emergency response to epidemics and new disease threats, to prevent communicable diseases such as flu, and to determine vital statistics such as life expectancy, mortality trends, etc. However, accurate collection and aggregation of high-quality mortality data remains an ongoing challenge due to issues such as the average low frequency with which physicians perform death certification, inconsistent training in determining the causes of death, complex data flow between the funeral home, the certifying physician and the registrar, and non-standard practices of data acquisition and transmission. We propose a smart application for medical providers at the point-of-care which will use \glsfirst{fhir} to integrate directly with the medical record, provide the practitioner with context for the death, and use machine learning techniques to enable the reporting of an accurate and complete causal chain of events leading to the death.Ph.D

    Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data. The core step of the KDD process is the application of Data Mining (DM) algorithms to efficiently find interesting patterns in large databases. This thesis concerns itself with three inter-related themes: Generalised interaction and rule mining; the incorporation of statistics into novel data mining approaches; and probabilistic frequent pattern mining in uncertain databases. An interaction describes an effect that variables have -- or appear to have -- on each other. Interaction mining is the process of mining structures on variables describing their interaction patterns -- usually represented as sets, graphs or rules. Interactions may be complex, represent both positive and negative relationships, and the presence of interactions can influence another interaction or variable in interesting ways. Finding interactions is useful in domains ranging from social network analysis, marketing, the sciences, e-commerce, to statistics and finance. Many data mining tasks may be considered as mining interactions, such as clustering; frequent itemset mining; association rule mining; classification rules; graph mining; flock mining; etc. Interaction mining problems can have very different semantics, pattern definitions, interestingness measures and data types. Solving a wide range of interaction mining problems at the abstract level, and doing so efficiently -- ideally more efficiently than with specialised approaches, is a challenging problem. This thesis introduces and solves the Generalised Interaction Mining (GIM) and Generalised Rule Mining (GRM) problems. GIM and GRM use an efficient and intuitive computational model based purely on vector valued functions. The semantics of the interactions, their interestingness measures and the type of data considered are flexible components of vectorised frameworks. By separating the semantics of a problem from the algorithm used to mine it, the frameworks allow both to vary independently of each other. This makes it easier to develop new methods by focusing purely on a problem's semantics and removing the burden of designing an efficient algorithm. By encoding interactions as vectors in the space (or a sub-space) of samples, they provide an intuitive geometric interpretation that inspires novel methods. By operating in time linear in the number of interesting interactions that need to be examined, the GIM and GRM algorithms are optimal. The use of GRM or GIM provides efficient solutions to a range of problems in this thesis, including graph mining, counting based methods, itemset mining, clique mining, a clustering problem, complex pattern mining, negative pattern mining, solving an optimisation problem, spatial data mining, probabilistic itemset mining, probabilistic association rule mining, feature selection and generation, classification and multiplication rule mining. Data mining is a hypothesis generating endeavour, examining large databases for patterns suggesting novel and useful knowledge to the user. Since the database is a sample, the patterns found should describe hypotheses about the underlying process generating the data. In searching for these patterns, a DM algorithm makes additional hypothesis when it prunes the search space. Natural questions to ask then, are: "Does the algorithm find patterns that are statistically significant?" and "Did the algorithm make significant decisions during its search?". Such questions address the quality of patterns found though data mining and the confidence that a user can have in utilising them. Finally, statistics has a range of useful tools and measures that are applicable in data mining. In this context, this thesis incorporates statistical techniques -- in particular, non-parametric significance tests and correlation -- directly into novel data mining approaches. This idea is applied to statistically significant and relatively class correlated rule based classification of imbalanced data sets; significant frequent itemset mining; mining complex correlation structures between variables for feature selection; mining correlated multiplication rules for interaction mining and feature generation; and conjunctive correlation rules for classification. The application of GIM or GRM to these problems lead to efficient and intuitive solutions. Frequent itemset mining (FIM) is a fundamental problem in data mining. While it is usually assumed that the items occurring in a transaction are known for certain, in many applications the data is inherently noisy or probabilistic; such as adding noise in privacy preserving data mining applications, aggregation or grouping of records leading to estimated purchase probabilities, and databases capturing naturally uncertain phenomena. The consideration of existential uncertainty of item(sets) makes traditional techniques inapplicable. Prior to the work in this thesis, itemsets were mined if their expected support is high. This returns only an estimate, ignores the probability distribution of support, provides no confidence in the results, and can lead to scenarios where itemsets are labeled frequent even if they are more likely to be infrequent. Clearly, this is undesirable. This thesis proposes and solves the Probabilistic Frequent Itemset Mining (PFIM) problem, where itemsets are considered interesting if the probability that they are frequent is high. The problem is solved under the possible worlds model and a proposed probabilistic framework for PFIM. Novel and efficient methods are developed for computing an itemset's exact support probability distribution and frequentness probability, using the Poisson binomial recurrence, generating functions, or a Normal approximation. Incremental methods are proposed to answer queries such as finding the top-k probabilistic frequent itemsets. A number of specialised PFIM algorithms are developed, with each being more efficient than the last: ProApriori is the first solution to PFIM and is based on candidate generation and testing. ProFP-Growth is the first probabilistic FP-Growth type algorithm and uses a proposed probabilistic frequent pattern tree (Pro-FPTree) to avoid candidate generation. Finally, the application of GIM leads to GIM-PFIM; the fastest known algorithm for solving the PFIM problem. It achieves orders of magnitude improvements in space and time usage, and leads to an intuitive subspace and probability-vector based interpretation of PFIM.Knowledge Discovery in Datenbanken (KDD) ist der nicht-triviale Prozess, gültiges, neues, potentiell nützliches und letztendlich verständliches Wissen aus großen Datensätzen zu extrahieren. Der wichtigste Schritt im KDD Prozess ist die Anwendung effizienter Data Mining (DM) Algorithmen um interessante Muster ("Patterns") in Datensätzen zu finden. Diese Dissertation beschäftigt sich mit drei verwandten Themen: Generalised Interaction und Rule Mining, die Einbindung von statistischen Methoden in neue DM Algorithmen und Probabilistic Frequent Itemset Mining (PFIM) in unsicheren Daten. Eine Interaktion ("Interaction") beschreibt den Einfluss, den Variablen aufeinander haben. Interaktionsmining ist der Prozess, Strukturen zwischen Variablen zu finden, die Interaktionsmuster beschreiben. Diese werden gewöhnlicherweise als Mengen, Graphen oder Regeln repräsentiert. Interaktionen können komplex sein und sowohl positive als auch negative Beziehungen repräsentieren. Außerdem kann das Vorhandensein von Interaktionen andere Interaktionen oder Variablen beeinflussen. Interaktionen stellen in Bereichen wie Soziale Netzwerk Analyse, Marketing, Wissenschaft, E-commerce, Statistik und Finanz wertvolle Information dar. Viele DM Methoden können als Interaktionsmining betrachtet werden: Zum Beispiel Clustering, Frequent Itemset Mining, Assoziationsregeln, Klassifikationsregeln, Graph Mining, Flock Mining, usw. Interaktionsmining-Probleme können sehr unterschiedliche Semantik, Musterdefinitionen, Interessantheitsmaße und Datentypen erfordern. Interaktionsmining-Probleme auf breiter und abstrakter Basis effizient -- und im Idealfall effizienter als mit spezialisierten Methoden -- zu lösen, ist ein herausforderndes Problem. Diese Dissertation führt das Generalised Interaction Mining (GIM) und das Generalised Rule Mining (GRM) Problem ein und beschreibt Lösungen für diese. GIM und GRM benutzen ein effizientes und intuitives Berechnungsmodell, das einzig und allein auf vektorbasierten Funktionen beruht. Die Semantik der Interaktionen, ihre Interessantheitsmaße und die Datenarten, sind Komponenten in vektorisierten Frameworks. Die Frameworks ermöglichen die Trennung der Problemsemantik vom Algorithmus, so dass beide unabhängig voneinander geändert werden können. Die Entwicklung neuer Methoden wird dadurch erleichtert, da man sich völlig auf die Problemsemantik fokussieren kann und sich nicht mit der Entwicklung problemspezifischer Algorithmen befassen muss. Die Kodierung der Interaktionen als Vektoren im gesamten Raum (oder Teilraum) der Stichproben stellt eine intuitive geometrische Interpretation dar, die neuartige Methoden inspiriert. Die GRM- und GIM- Algorithmen haben lineare Laufzeit in der Anzahl der Interaktionen die geprüft werden müssen und sind somit optimal. Die Anwendung von GRM oder GIM in dieser Dissertation ermöglicht effiziente Lösungen für eine Reihe von Problemen, wie zum Beispiel Graph Mining, Aufzählungsmethoden, Itemset Mining, Clique Mining, ein Clusteringproblem, das Finden von komplexen und negativen Mustern, die Lösung von Optimierungsproblemen, Spatial Data Mining, probabilistisches Itemset Mining, probabilistisches Mining von Assoziationsregel, Selektion und Erzeugung von Features, Mining von Klassifikations- und Multiplikationsregel, u.v.m. Data Mining ist ein Verfahren, das Hypothesen produziert, indem es in großen Datensätzen Muster findet und damit für den Anwender neues und nützliches Wissen vorschlägt. Da die untersuchte Datenbank ein Resultat des datenerzeugenden Prozesses ist, sollten die gefundenen Muster Erkenntnisse über diesen Prozess liefern. Bei der Suche nach diesen Mustern macht ein DM Algorithmus zusätzliche Hypothesen, wenn Teile des Suchraums ausgeschlossen werden. Die folgenden Fragen sind dabei wichtig: "Findet der Algorithmus statistisch signifikante Muster?" und "Hat der Algorithmus während des Suchprozesses signifikante Entscheidungen getroffen?". Diese Fragen beeinflussen die Qualität der Muster und die Sicherheit die der Anwender in ihrer Benutzung haben kann. Da die Statistik auch eine Reihe von nützlichen Methoden bereitstellt, die für DM anwendbar sind, kombiniert diese Dissertation einige statistische Methoden mit neuen DM Algorithmen, insbesondere nicht-parametrische Signifikanztests und Korrelation. Diese Idee wird für die folgenden Probleme angewandt: Signifikante und "relatively class correlated" regelbasierte Klassifikation in unsymmetrischen Datensätzen, signifikantes Frequent Itemset Mining, Mining von komplizierten Korrelationsstrukturen zwischen Variablen zum Zweck der Featureselektion, Mining von korrelierten Multiplikationsregeln zum Zwecke des Interaktionsminings und Featureerzeugung und konjunktive Korrelationsregeln für die Klassifikation. Die Anwendung von GIM und GRM auf diese Probleme führt zu effizienten und intuitiven Lösungen. Frequent Itemset Mining (FIM) ist ein fundamentales Problem im Data Mining. Obwohl allgemein die Annahme gilt, dass in einer Transaktion enthaltene Items bekannt sind, sind die Daten in vielen Anwendungen unsicher oder probabilistisch. Beispiele sind das Hinzufügen von Rauschen zu Datenschutzzwecken, die Gruppierung von Datensätzen die zu geschätzten Kaufwahrscheinlichkeiten führen und Datensätze deren Herkunft von Natur aus unsicher sind. Die Berücksichtigung von unsicheren Datensätzen verhindert die Anwendung von traditionellen Methoden. Vor der Arbeit in dieser Dissertation wurden Itemsets gesucht, deren erwartetes Vorkommen hoch ist. Diese Methode produziert jedoch nur Schätzwerte, vernachlässigt die Wahrscheinlichkeitsverteilung der Vorkommen, bietet keine Sicherheit für die Genauigkeit der Ergebnisse und kann zu Szenarien führen in denen das Vorkommen als häufig eingestuft wird, obwohl die Wahrscheinlichkeit höher ist, dass sie nur selten vorkommen. Solche Ergebnisse sind natürlich unerwünscht. Diese Dissertation führt das Probabilistic Frequent Itemset Mining (PFIM) ein. Diese Lösung betrachtet Itemsets als interessant, wenn die Wahrscheinlichkeit groß ist, dass sie häufig vorkommen. Die Problemlösung besteht aus der Anwendung des Possible Worlds Models und dem vorgeschlagenen probabilistisches Framework für PFIM. Es werden neue und effiziente Methoden entwickelt um die Wahrscheinlichkeitsverteilung des Vorkommens und die Häufigkeitsverteilung eines Itemsets zu berechnen. Dazu werden die Poisson Binomial Recurrence, Generating Functions, oder eine normalverteilte Annäherung verwendet. Inkrementelle Methoden werden vorgeschlagen um Fragen wie "Finde die top-k Probabilistic Frequent Itemsets" zu beantworten. Mehrere PFIM Algorithmen werden entwickelt, wobei die Effizienz von Algorithmus zu Algorithmus steigt: ProApriori ist die erste Lösung für PFIM und basiert auf erzeugen und testen von Kandidaten. ProFP-Growth ist der erste probabilistische FP-Growth Algorithmus. Er schlägt einen Probabilistic Frequent Pattern Tree (Pro-FPTree) vor, der Kandidatenerzeugung überflüssig macht. Die Anwendung von GIM führt schließlich zu GIM-PFIM, dem schnellsten bekannten Algorithmus zur Lösung des PFIM Problems. Dieser Algorithmus resultiert in einem um Größenordnungen besseren Zeit- und Speicherbedarf, und führt zu einer intuitiven Interpretation von PFIM, basierend auf Unterräumen und Wahrscheinlichkeitsvektoren

    Generalised Interaction Mining: Probabilistic, Statistical and Vectorised Methods in High Dimensional or Uncertain Databases

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, useful and ultimately understandable patterns in data. The core step of the KDD process is the application of Data Mining (DM) algorithms to efficiently find interesting patterns in large databases. This thesis concerns itself with three inter-related themes: Generalised interaction and rule mining; the incorporation of statistics into novel data mining approaches; and probabilistic frequent pattern mining in uncertain databases. An interaction describes an effect that variables have -- or appear to have -- on each other. Interaction mining is the process of mining structures on variables describing their interaction patterns -- usually represented as sets, graphs or rules. Interactions may be complex, represent both positive and negative relationships, and the presence of interactions can influence another interaction or variable in interesting ways. Finding interactions is useful in domains ranging from social network analysis, marketing, the sciences, e-commerce, to statistics and finance. Many data mining tasks may be considered as mining interactions, such as clustering; frequent itemset mining; association rule mining; classification rules; graph mining; flock mining; etc. Interaction mining problems can have very different semantics, pattern definitions, interestingness measures and data types. Solving a wide range of interaction mining problems at the abstract level, and doing so efficiently -- ideally more efficiently than with specialised approaches, is a challenging problem. This thesis introduces and solves the Generalised Interaction Mining (GIM) and Generalised Rule Mining (GRM) problems. GIM and GRM use an efficient and intuitive computational model based purely on vector valued functions. The semantics of the interactions, their interestingness measures and the type of data considered are flexible components of vectorised frameworks. By separating the semantics of a problem from the algorithm used to mine it, the frameworks allow both to vary independently of each other. This makes it easier to develop new methods by focusing purely on a problem's semantics and removing the burden of designing an efficient algorithm. By encoding interactions as vectors in the space (or a sub-space) of samples, they provide an intuitive geometric interpretation that inspires novel methods. By operating in time linear in the number of interesting interactions that need to be examined, the GIM and GRM algorithms are optimal. The use of GRM or GIM provides efficient solutions to a range of problems in this thesis, including graph mining, counting based methods, itemset mining, clique mining, a clustering problem, complex pattern mining, negative pattern mining, solving an optimisation problem, spatial data mining, probabilistic itemset mining, probabilistic association rule mining, feature selection and generation, classification and multiplication rule mining. Data mining is a hypothesis generating endeavour, examining large databases for patterns suggesting novel and useful knowledge to the user. Since the database is a sample, the patterns found should describe hypotheses about the underlying process generating the data. In searching for these patterns, a DM algorithm makes additional hypothesis when it prunes the search space. Natural questions to ask then, are: "Does the algorithm find patterns that are statistically significant?" and "Did the algorithm make significant decisions during its search?". Such questions address the quality of patterns found though data mining and the confidence that a user can have in utilising them. Finally, statistics has a range of useful tools and measures that are applicable in data mining. In this context, this thesis incorporates statistical techniques -- in particular, non-parametric significance tests and correlation -- directly into novel data mining approaches. This idea is applied to statistically significant and relatively class correlated rule based classification of imbalanced data sets; significant frequent itemset mining; mining complex correlation structures between variables for feature selection; mining correlated multiplication rules for interaction mining and feature generation; and conjunctive correlation rules for classification. The application of GIM or GRM to these problems lead to efficient and intuitive solutions. Frequent itemset mining (FIM) is a fundamental problem in data mining. While it is usually assumed that the items occurring in a transaction are known for certain, in many applications the data is inherently noisy or probabilistic; such as adding noise in privacy preserving data mining applications, aggregation or grouping of records leading to estimated purchase probabilities, and databases capturing naturally uncertain phenomena. The consideration of existential uncertainty of item(sets) makes traditional techniques inapplicable. Prior to the work in this thesis, itemsets were mined if their expected support is high. This returns only an estimate, ignores the probability distribution of support, provides no confidence in the results, and can lead to scenarios where itemsets are labeled frequent even if they are more likely to be infrequent. Clearly, this is undesirable. This thesis proposes and solves the Probabilistic Frequent Itemset Mining (PFIM) problem, where itemsets are considered interesting if the probability that they are frequent is high. The problem is solved under the possible worlds model and a proposed probabilistic framework for PFIM. Novel and efficient methods are developed for computing an itemset's exact support probability distribution and frequentness probability, using the Poisson binomial recurrence, generating functions, or a Normal approximation. Incremental methods are proposed to answer queries such as finding the top-k probabilistic frequent itemsets. A number of specialised PFIM algorithms are developed, with each being more efficient than the last: ProApriori is the first solution to PFIM and is based on candidate generation and testing. ProFP-Growth is the first probabilistic FP-Growth type algorithm and uses a proposed probabilistic frequent pattern tree (Pro-FPTree) to avoid candidate generation. Finally, the application of GIM leads to GIM-PFIM; the fastest known algorithm for solving the PFIM problem. It achieves orders of magnitude improvements in space and time usage, and leads to an intuitive subspace and probability-vector based interpretation of PFIM.Knowledge Discovery in Datenbanken (KDD) ist der nicht-triviale Prozess, gültiges, neues, potentiell nützliches und letztendlich verständliches Wissen aus großen Datensätzen zu extrahieren. Der wichtigste Schritt im KDD Prozess ist die Anwendung effizienter Data Mining (DM) Algorithmen um interessante Muster ("Patterns") in Datensätzen zu finden. Diese Dissertation beschäftigt sich mit drei verwandten Themen: Generalised Interaction und Rule Mining, die Einbindung von statistischen Methoden in neue DM Algorithmen und Probabilistic Frequent Itemset Mining (PFIM) in unsicheren Daten. Eine Interaktion ("Interaction") beschreibt den Einfluss, den Variablen aufeinander haben. Interaktionsmining ist der Prozess, Strukturen zwischen Variablen zu finden, die Interaktionsmuster beschreiben. Diese werden gewöhnlicherweise als Mengen, Graphen oder Regeln repräsentiert. Interaktionen können komplex sein und sowohl positive als auch negative Beziehungen repräsentieren. Außerdem kann das Vorhandensein von Interaktionen andere Interaktionen oder Variablen beeinflussen. Interaktionen stellen in Bereichen wie Soziale Netzwerk Analyse, Marketing, Wissenschaft, E-commerce, Statistik und Finanz wertvolle Information dar. Viele DM Methoden können als Interaktionsmining betrachtet werden: Zum Beispiel Clustering, Frequent Itemset Mining, Assoziationsregeln, Klassifikationsregeln, Graph Mining, Flock Mining, usw. Interaktionsmining-Probleme können sehr unterschiedliche Semantik, Musterdefinitionen, Interessantheitsmaße und Datentypen erfordern. Interaktionsmining-Probleme auf breiter und abstrakter Basis effizient -- und im Idealfall effizienter als mit spezialisierten Methoden -- zu lösen, ist ein herausforderndes Problem. Diese Dissertation führt das Generalised Interaction Mining (GIM) und das Generalised Rule Mining (GRM) Problem ein und beschreibt Lösungen für diese. GIM und GRM benutzen ein effizientes und intuitives Berechnungsmodell, das einzig und allein auf vektorbasierten Funktionen beruht. Die Semantik der Interaktionen, ihre Interessantheitsmaße und die Datenarten, sind Komponenten in vektorisierten Frameworks. Die Frameworks ermöglichen die Trennung der Problemsemantik vom Algorithmus, so dass beide unabhängig voneinander geändert werden können. Die Entwicklung neuer Methoden wird dadurch erleichtert, da man sich völlig auf die Problemsemantik fokussieren kann und sich nicht mit der Entwicklung problemspezifischer Algorithmen befassen muss. Die Kodierung der Interaktionen als Vektoren im gesamten Raum (oder Teilraum) der Stichproben stellt eine intuitive geometrische Interpretation dar, die neuartige Methoden inspiriert. Die GRM- und GIM- Algorithmen haben lineare Laufzeit in der Anzahl der Interaktionen die geprüft werden müssen und sind somit optimal. Die Anwendung von GRM oder GIM in dieser Dissertation ermöglicht effiziente Lösungen für eine Reihe von Problemen, wie zum Beispiel Graph Mining, Aufzählungsmethoden, Itemset Mining, Clique Mining, ein Clusteringproblem, das Finden von komplexen und negativen Mustern, die Lösung von Optimierungsproblemen, Spatial Data Mining, probabilistisches Itemset Mining, probabilistisches Mining von Assoziationsregel, Selektion und Erzeugung von Features, Mining von Klassifikations- und Multiplikationsregel, u.v.m. Data Mining ist ein Verfahren, das Hypothesen produziert, indem es in großen Datensätzen Muster findet und damit für den Anwender neues und nützliches Wissen vorschlägt. Da die untersuchte Datenbank ein Resultat des datenerzeugenden Prozesses ist, sollten die gefundenen Muster Erkenntnisse über diesen Prozess liefern. Bei der Suche nach diesen Mustern macht ein DM Algorithmus zusätzliche Hypothesen, wenn Teile des Suchraums ausgeschlossen werden. Die folgenden Fragen sind dabei wichtig: "Findet der Algorithmus statistisch signifikante Muster?" und "Hat der Algorithmus während des Suchprozesses signifikante Entscheidungen getroffen?". Diese Fragen beeinflussen die Qualität der Muster und die Sicherheit die der Anwender in ihrer Benutzung haben kann. Da die Statistik auch eine Reihe von nützlichen Methoden bereitstellt, die für DM anwendbar sind, kombiniert diese Dissertation einige statistische Methoden mit neuen DM Algorithmen, insbesondere nicht-parametrische Signifikanztests und Korrelation. Diese Idee wird für die folgenden Probleme angewandt: Signifikante und "relatively class correlated" regelbasierte Klassifikation in unsymmetrischen Datensätzen, signifikantes Frequent Itemset Mining, Mining von komplizierten Korrelationsstrukturen zwischen Variablen zum Zweck der Featureselektion, Mining von korrelierten Multiplikationsregeln zum Zwecke des Interaktionsminings und Featureerzeugung und konjunktive Korrelationsregeln für die Klassifikation. Die Anwendung von GIM und GRM auf diese Probleme führt zu effizienten und intuitiven Lösungen. Frequent Itemset Mining (FIM) ist ein fundamentales Problem im Data Mining. Obwohl allgemein die Annahme gilt, dass in einer Transaktion enthaltene Items bekannt sind, sind die Daten in vielen Anwendungen unsicher oder probabilistisch. Beispiele sind das Hinzufügen von Rauschen zu Datenschutzzwecken, die Gruppierung von Datensätzen die zu geschätzten Kaufwahrscheinlichkeiten führen und Datensätze deren Herkunft von Natur aus unsicher sind. Die Berücksichtigung von unsicheren Datensätzen verhindert die Anwendung von traditionellen Methoden. Vor der Arbeit in dieser Dissertation wurden Itemsets gesucht, deren erwartetes Vorkommen hoch ist. Diese Methode produziert jedoch nur Schätzwerte, vernachlässigt die Wahrscheinlichkeitsverteilung der Vorkommen, bietet keine Sicherheit für die Genauigkeit der Ergebnisse und kann zu Szenarien führen in denen das Vorkommen als häufig eingestuft wird, obwohl die Wahrscheinlichkeit höher ist, dass sie nur selten vorkommen. Solche Ergebnisse sind natürlich unerwünscht. Diese Dissertation führt das Probabilistic Frequent Itemset Mining (PFIM) ein. Diese Lösung betrachtet Itemsets als interessant, wenn die Wahrscheinlichkeit groß ist, dass sie häufig vorkommen. Die Problemlösung besteht aus der Anwendung des Possible Worlds Models und dem vorgeschlagenen probabilistisches Framework für PFIM. Es werden neue und effiziente Methoden entwickelt um die Wahrscheinlichkeitsverteilung des Vorkommens und die Häufigkeitsverteilung eines Itemsets zu berechnen. Dazu werden die Poisson Binomial Recurrence, Generating Functions, oder eine normalverteilte Annäherung verwendet. Inkrementelle Methoden werden vorgeschlagen um Fragen wie "Finde die top-k Probabilistic Frequent Itemsets" zu beantworten. Mehrere PFIM Algorithmen werden entwickelt, wobei die Effizienz von Algorithmus zu Algorithmus steigt: ProApriori ist die erste Lösung für PFIM und basiert auf erzeugen und testen von Kandidaten. ProFP-Growth ist der erste probabilistische FP-Growth Algorithmus. Er schlägt einen Probabilistic Frequent Pattern Tree (Pro-FPTree) vor, der Kandidatenerzeugung überflüssig macht. Die Anwendung von GIM führt schließlich zu GIM-PFIM, dem schnellsten bekannten Algorithmus zur Lösung des PFIM Problems. Dieser Algorithmus resultiert in einem um Größenordnungen besseren Zeit- und Speicherbedarf, und führt zu einer intuitiven Interpretation von PFIM, basierend auf Unterräumen und Wahrscheinlichkeitsvektoren

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    HIDE: User centred Domotic evolution toward Ambient Intelligence

    Get PDF
    Pervasive Computing and Ambient Intelligence (AmI) visions are still far from being achieved, especially with regard to Domotics and home applications. According to the vision of Ambient Intelligence (AmI), the most advanced technologies are those that disappear: at maturity, computer technology should become invisible. All the objects surrounding us must possess sufficient computing capacity to interact with users, the surroundings and each other. The entire physical environment in which users are immersed should thus be a hidden computer system equipped with the appropriate software in order to exhibit intelligent behavior. Even though many implementations have started to appear in several contexts, few applications have been made available for the home environment and the general public. This is mainly due to the segmentation of standards and proprietary solutions, which are currently confusing the market with a sparse offer of uninteroperable devices and systems. Although modern houses are equipped with smart technological appliances, still very few of these appliances can be seamlessly connected to each other. The objective of this research work is to take steps in these directions by proposing, on the one hand, a software system designed to make today’s heterogeneous, mostly incompatible domotic systems fully interoperable and, on the other hand, a feasible software application able to learn the behavior and habits of home inhabitants in order to actively contribute to anticipating user needs, and preventing emergency situations for his health. By applying machine learning techniques, the system offers a complete, ready-to-use practical application that learns through interaction with the user in order to improve life quality in a technological living environment, such as a house, a smart city and so on. The proposed solution, besides making life more comfortable for users without particular needs, represents an opportunity to provide greater autonomy and safety to disabled and elderly occupants, especially the critically ill ones. The prototype has been developed and is currently running at the Pisa CNR laboratory, where a home environment has been faithfully recreated

    Mobile Ad-Hoc Networks

    Get PDF
    Being infrastructure-less and without central administration control, wireless ad-hoc networking is playing a more and more important role in extending the coverage of traditional wireless infrastructure (cellular networks, wireless LAN, etc). This book includes state-of the-art techniques and solutions for wireless ad-hoc networks. It focuses on the following topics in ad-hoc networks: vehicular ad-hoc networks, security and caching, TCP in ad-hoc networks and emerging applications. It is targeted to provide network engineers and researchers with design guidelines for large scale wireless ad hoc networks

    From Data to Knowledge in Secondary Health Care Databases

    Get PDF
    The advent of big data in health care is a topic receiving increasing attention worldwide. In the UK, over the last decade, the National Health Service (NHS) programme for Information Technology has boosted big data by introducing electronic infrastructures in hospitals and GP practices across the country. This ever growing amount of data promises to expand our understanding of the services, processes and research. Potential bene�ts include reducing costs, optimisation of services, knowledge discovery, and patient-centred predictive modelling. This thesis will explore the above by studying over ten years worth of electronic data and systems in a hospital treating over 750 thousand patients a year. The hospital's information systems store routinely collected data, used primarily by health practitioners to support and improve patient care. This raw data is recorded on several di�erent systems but rarely linked or analysed. This thesis explores the secondary uses of such data by undertaking two case studies, one on prostate cancer and another on stroke. The journey from data to knowledge is made in each of the studies by traversing critical steps: data retrieval, linkage, integration, preparation, mining and analysis. Throughout, novel methods and computational techniques are introduced and the value of routinely collected data is assessed. In particular, this thesis discusses in detail the methodological aspects of developing clinical data warehouses from routine heterogeneous data and it introduces methods to model, visualise and analyse the journeys that patients take through care. This work has provided lessons in hospital IT provision, integration, visualisation and analytics of complex electronic patient records and databases and has enabled the use of raw routine data for management decision making and clinical research in both case studies
    corecore