500 research outputs found

    More than the sum of its parts – pattern mining, neural networks, and how they complement each other

    Get PDF
    In this thesis we explore pattern mining and deep learning. Often seen as orthogonal, we show that these fields complement each other and propose to combine them to gain from each other’s strengths. We, first, show how to efficiently discover succinct and non-redundant sets of patterns that provide insight into data beyond conjunctive statements. We leverage the interpretability of such patterns to unveil how and which information flows through neural networks, as well as what characterizes their decisions. Conversely, we show how to combine continuous optimization with pattern discovery, proposing a neural network that directly encodes discrete patterns, which allows us to apply pattern mining at a scale orders of magnitude larger than previously possible. Large neural networks are, however, exceedingly expensive to train for which ‘lottery tickets’ – small, well-trainable sub-networks in randomly initialized neural networks – offer a remedy. We identify theoretical limitations of strong tickets and overcome them by equipping these tickets with the property of universal approximation. To analyze whether limitations in ticket sparsity are algorithmic or fundamental, we propose a framework to plant and hide lottery tickets. With novel ticket benchmarks we then conclude that the limitation is likely algorithmic, encouraging further developments for which our framework offers means to measure progress.In dieser Arbeit befassen wir uns mit Mustersuche und Deep Learning. Oft als gegensätzlich betrachtet, verbinden wir diese Felder, um von den Stärken beider zu profitieren. Wir zeigen erst, wie man effizient prägnante Mengen von Mustern entdeckt, die Einsichten über konjunktive Aussagen hinaus geben. Wir nutzen dann die Interpretierbarkeit solcher Muster, um zu verstehen wie und welche Information durch neuronale Netze fließen und was ihre Entscheidungen charakterisiert. Umgekehrt verbinden wir kontinuierliche Optimierung mit Mustererkennung durch ein neuronales Netz welches diskrete Muster direkt abbildet, was Mustersuche in einigen Größenordnungen höher erlaubt als bisher möglich. Das Training großer neuronaler Netze ist jedoch extrem teuer, für das ’Lotterietickets’ – kleine, gut trainierbare Subnetzwerke in zufällig initialisierten neuronalen Netzen – eine Lösung bieten. Wir zeigen theoretische Einschränkungen von starken Tickets und wie man diese überwindet, indem man die Tickets mit der Eigenschaft der universalen Approximierung ausstattet. Um zu beantworten, ob Einschränkungen in Ticketgröße algorithmischer oder fundamentaler Natur sind, entwickeln wir ein Rahmenwerk zum Einbetten und Verstecken von Tickets, die als Modellfälle dienen. Basierend auf unseren Ergebnissen schließen wir, dass die Einschränkungen algorithmische Ursachen haben, was weitere Entwicklungen begünstigt, für die unser Rahmenwerk Fortschrittsevaluierungen ermöglicht

    Integrazione di dati on-demand

    Get PDF
    Sempre più spesso aziende e organizzazioni basano le proprie decisioni sui dati di cui dispongono. Garantire la qualità di tali dati è fondamentale per poter effettuare analisi accurate e affidabili. L'integrazione dei dati consiste nel combinare dati acquisiti da molteplici sorgenti eterogenee per fornire all'utente finale una vista unitaria e coerente su tali dati. Si tratta perciò di un processo fondamentale per incrementare il valore dei dati disponibili. In passato, operando su numeri limitati di sorgenti, il paradigma di riferimento, noto come ETL, richiedeva di estrarre i dati grezzi, pulirli e immagazzinarli in un data warehouse per poterli poi analizzare. Al giorno d'oggi, operando su milioni di sorgenti, è sempre più diffuso il paradigma noto invece come ELT, per il quale i dati grezzi vengono raccolti in grandi quantità e immagazzinati così come sono, ad esempio in un data lake. Gli utenti possono poi pulire le porzioni di dati utili per le loro applicazioni. È pertanto necessario studiare soluzioni innovative per l'integrazione dei dati, maggiormente adatte alle nuove sfide che tale modello comporta. Uno dei processi fondamentali per l'integrazione dei dati è la riconciliazione di entità, che consiste nell'individuare i profili che descrivono la stessa entità reale (duplicati) per consolidarli in un unico profilo coerente. Tradizionalmente, questo processo viene effettuato sull'intero dataset prima di poter operare su di esso, risultando perciò spesso molto costoso. In molti casi, solo una porzione delle entità pulite si rivela utile per l'applicazione dell'utente finale. Ad esempio, operando su dati raccolti dal Web, è fondamentale poter filtrare le entità d'interesse senza dover pulire l'intera mole di dati, in continua crescita. Allo stesso modo, quando si effettuano interrogazioni su un data lake, si vuole pulire su richiesta solo la porzione di interesse, ottenendo i relativi risultati nel minor tempo possibile. Per rispondere a tali esigenze presentiamo BrewER, un framework per eseguire interrogazioni SQL su dati sporchi emettendo progressivamente i risultati come se fossero stati ottenuti sui dati puliti. BrewER focalizza il processo di pulizia su un'entità alla volta, in base a una priorità definita dall'utente nella clausola ORDER BY. Per molte applicazioni, come l'esplorazione dei dati, BrewER consente di risparmiare una grande quantità di tempo e risorse. I duplicati non esistono però solo a livello di singoli profili, ma anche a livello di dataset. È infatti comune ad esempio che un data scientist per le proprie analisi effettui trasformazioni su un dataset presente nel data lake aziendale, immagazzinando poi anche la nuova versione ottenuta all'interno del data lake stesso. Situazioni simili si verificano nel Web, ad esempio su Wikipedia, dove le tabelle vengono spesso duplicate e le copie ottenute hanno uno sviluppo indipendente, con la possibile insorgenza di inconsistenze. Individuare automaticamente queste tabelle duplicate consente di renderle coerenti con operazione di pulizia dei dati o propagazione delle modifiche, oppure di rimuovere le copie ridondanti per liberare spazio di archiviazione o risparmiare futuro lavoro agli editori. La ricerca di tabelle duplicate è stata perlopiù ignorata dalla letteratura esistente. Per colmare questa mancanza presentiamo Sloth, un framework che, date due tabelle, consente di determinarne la più grande sottotabella in comune, consentendo di quantificarne la similarità e di rilevare le possibili inconsistenze. BrewER e Sloth rappresentano soluzioni innovative per l'integrazione dei dati nello scenario ELT, utilizzando le risorse a disposizione su richiesta e indirizzando il processo di integrazione dei dati verso un approccio orientato alle applicazioni.Companies and organizations depend heavily on their data to make informed business decisions. Therefore, guaranteeing high data quality is critical to ensure the reliability of data analysis. Data integration, which aims to combine data acquired from several heterogeneous sources to provide users with a unified consistent view, plays a fundamental role to enhance the value of the data at hand. In the past, when data integration involved a limited number of sources, ETL (extract, transform, load) established as the most popular paradigm: once collected, raw data is cleaned, then stored in a data warehouse to perform analysis on it. Nowadays, big data integration needs to deal with millions of sources; thus, the paradigm is more and more moving towards ELT (extract, load, transform). A huge amount of raw data is collected and directly stored (e.g., in a data lake), then different users can transform portions of it according to the task at hand. Hence, novel approaches to data integration need to be explored to address the challenges raised by this paradigm. One of the fundamental building blocks for data integration is entity resolution (ER), which aims at detecting profiles that describe the same real-world entity, to consolidate them into a single consistent representation. ER is typically employed as an expensive offline cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire continuously growing data. Similarly, when querying data lakes, we want to transform data on-demand and return results in a timely manner. Hence, we propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, according to the priority defined by the user through the ORDER BY clause. For a wide range of applications (e.g., data exploration), a significant amount of resources can therefore be saved. Further, duplicates not only exist at profile level, as in the case for ER, but also at dataset level. In the ELT scenario, it is common for data scientists to retrieve datasets from the enterprise’s data lake, perform transformations for their analysis, then store back the new datasets into the data lake. Similarly, in Web contexts such as Wikipedia, a table can be duplicated at a given time, with the different copies having independent development, possibly leading to the insurgence of inconsistencies. Automatically detecting duplicate tables would allow to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. While dataset discovery research developed efficient tools to retrieve unionable or joinable tables, the problem of detecting duplicate tables has been mostly overlooked in the existing literature. To fill this gap, we therefore present Sloth, a framework to efficiently determine the largest overlap (i.e., the largest common subtable) between two tables. The detection of the largest overlap allows to quantify the similarity between the two tables and spot their inconsistencies. BrewER and Sloth represent novel solutions to perform big data integration in the ELT scenario, fostering on-demand use of available resources and shifting this fundamental task towards a task-driven paradigm

    Rule Learning over Knowledge Graphs: A Review

    Get PDF
    Compared to black-box neural networks, logic rules express explicit knowledge, can provide human-understandable explanations for reasoning processes, and have found their wide application in knowledge graphs and other downstream tasks. As extracting rules manually from large knowledge graphs is labour-intensive and often infeasible, automated rule learning has recently attracted significant interest, and a number of approaches to rule learning for knowledge graphs have been proposed. This survey aims to provide a review of approaches and a classification of state-of-the-art systems for learning first-order logic rules over knowledge graphs. A comparative analysis of various approaches to rule learning is conducted based on rule language biases, underlying methods, and evaluation metrics. The approaches we consider include inductive logic programming (ILP)-based, statistical path generalisation, and neuro-symbolic methods. Moreover, we highlight important and promising application scenarios of rule learning, such as rule-based knowledge graph completion, fact checking, and applications in other research areas

    Association Pattern Analysis for Pattern Pruning, Clustering and Summarization

    Get PDF
    Automatic pattern mining from databases and the analysis of the discovered patterns for useful information are important and in great demand in science, engineering and business. Today, effective pattern mining methods, such as association rule mining and pattern discovery, have been developed and widely used in various challenging industrial and business applications. These methods attempt to uncover the valuable information trapped in large collections of raw data. The patterns revealed provide significant and useful information for decision makers. Paradoxically, pattern mining itself can produce such huge amounts of data that poses a new knowledge management problem: to tackle thousands or even more patterns discovered and held in a data set. Unlike raw data, patterns often overlap, entangle and interrelate to each other in the databases. The relationship among them is usually complex and the notion of distance between them is difficult to qualify and quantify. Such phenomena pose great challenges to the existing data mining discipline. In this thesis, the analysis of patterns after their discovery by existing pattern mining methods is referred to as pattern post-analysis since the patterns to be analyzed are first discovered. Due to the overwhelmingly huge volume of discovered patterns in pattern mining, it is virtually impossible for a human user to manually analyze them. Thus, the valuable trapped information in the data is shifted to a large collection of patterns. Hence, to automatically analyze the patterns discovered and present the results in a user-friendly manner such as pattern post-analysis is badly needed. This thesis attempts to solve the problems listed below. It addresses 1) the important factors contributing to the interrelating relationship among patterns and hence more accurate measurements of distances between them; 2) the objective pruning of redundant patterns from the discovered patterns; 3) the objective clustering of the patterns into coherent pattern clusters for better organization; 4) the automatic summarization of each pattern cluster for human interpretation; and 5) the application of pattern post-analysis to large database analysis and data mining. In this thesis, the conceptualization, theoretical formulation, algorithm design and system development of pattern post-analysis of categorical or discrete-valued data is presented. It starts with presenting a natural dual relationship between patterns and data. The relationship furnishes an explicit one-to-one correspondence between a pattern and its associated data and provides a base for an effective analysis of patterns by relating them back to the data. It then discusses the important factors that differentiate patterns and formulates the notion of distances among patterns using a formal graphical approach. To accurately measure the distances between patterns and their associated data, both the samples and the attributes matched by the patterns are considered. To achieve this, the distance measure between patterns has to account for the differences of their associated data clusters at the attribute value (i.e. item) level. Furthermore, to capture the degree of variation of the items matched by patterns, entropy-based distance measures are developed. It attempts to quantify the uncertainty of the matched items. Such distances render an accurate and robust distance measurement between patterns and their associated data. To understand the properties and behaviors of the new distance measures, the mathematical relation between the new distances and the existing sample-matching distances is analytically derived. The new pattern distances based on the dual pattern-data relationship and their related concepts are used and adapted to pattern pruning, pattern clustering and pattern summarization to furnish an integrated, flexible and generic framework for pattern post-analysis which is able to meet the challenges of today’s complex real-world problems. In pattern pruning, the system defines the amount of redundancy of a pattern with respect to another pattern at the item level. Such definition generalizes the classical closed itemset pruning and maximal itemset pruning which define redundancy at the sample level. A new generalized itemset pruning method is developed using the new definition. It includes the closed and maximal itemsets as two extreme special cases and provides a control parameter for the user to adjust the tradeoff between the number of patterns being pruned and the amount of information loss after pruning. The mathematical relation between the proposed generalized itemsets and the existing closed and maximal itemsets are also given. In pattern clustering, a dual clustering method, known as simultaneous pattern and data clustering, is developed using two common yet very different types of clustering algorithms: hierarchical clustering and k-means clustering. Hierarchical clustering generates the entire clustering hierarchy but it is slow and not scalable. K-means clustering produces only a partition so it is fast and scalable. They can be used to handle most real-world situations (i.e. speed and clustering quality). The new clustering method is able to simultaneously cluster patterns as well as their associated data while maintaining an explicit pattern-data relationship. Such relationship enables subsequent analysis of individual pattern clusters through their associated data clusters. One important analysis on a pattern cluster is pattern summarization. In pattern summarization, to summarize each pattern cluster, a subset of the representative patterns will be selected for the cluster. Again, the system measures how representative a pattern is at the item level and takes into account how the patterns overlap each other. The proposed method, called AreaCover, is extended from the well-known RuleCover algorithm. The relationship between the two methods is given. AreaCover is less prone to yield large, trivial patterns (large patterns may cause summary that is too general and not informative enough), and the resulting summary is more concise (with less duplicated attribute values among summary patterns) and more informative (describing more attribute values in the cluster and have longer summary patterns). The thesis also covers the implementation of the major ideas outlined in the pattern post-analysis framework in an integrated software system. It ends with a discussion on the experimental results of pattern post-analysis on both synthetic and real-world benchmark data. Compared with the existing systems, the new methodology that this thesis presents stands out, possessing significant and superior characteristics in pattern post-analysis and decision support

    Learning lost temporal fuzzy association rules

    Get PDF
    Fuzzy association rule mining discovers patterns in transactions, such as shopping baskets in a supermarket, or Web page accesses by a visitor to a Web site. Temporal patterns can be present in fuzzy association rules because the underlying process generating the data can be dynamic. However, existing solutions may not discover all interesting patterns because of a previously unrecognised problem that is revealed in this thesis. The contextual meaning of fuzzy association rules changes because of the dynamic feature of data. The static fuzzy representation and traditional search method are inadequate. The Genetic Iterative Temporal Fuzzy Association Rule Mining (GITFARM) framework solves the problem by utilising flexible fuzzy representations from a fuzzy rule-based system (FRBS). The combination of temporal, fuzzy and itemset space was simultaneously searched with a genetic algorithm (GA) to overcome the problem. The framework transforms the dataset to a graph for efficiently searching the dataset. A choice of model in fuzzy representation provides a trade-off in usage between an approximate and descriptive model. A method for verifying the solution to the hypothesised problem was presented. The proposed GA-based solution was compared with a traditional approach that uses an exhaustive search method. It was shown how the GA-based solution discovered rules that the traditional approach did not. This shows that simultaneously searching for rules and membership functions with a GA is a suitable solution for mining temporal fuzzy association rules. So, in practice, more knowledge can be discovered for making well-informed decisions that would otherwise be lost with a traditional approach.EPSRC DT

    A Survey on Explainable Anomaly Detection

    Full text link
    In the past two decades, most research on anomaly detection has focused on improving the accuracy of the detection, while largely ignoring the explainability of the corresponding methods and thus leaving the explanation of outcomes to practitioners. As anomaly detection algorithms are increasingly used in safety-critical domains, providing explanations for the high-stakes decisions made in those domains has become an ethical and regulatory requirement. Therefore, this work provides a comprehensive and structured survey on state-of-the-art explainable anomaly detection techniques. We propose a taxonomy based on the main aspects that characterize each explainable anomaly detection technique, aiming to help practitioners and researchers find the explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from Data (TKDD) for publication (preprint version
    • …
    corecore