13 research outputs found

    Mining Top-K Patterns from Binary Datasets in presence of Noise

    Full text link

    From-Below Boolean Matrix Factorization Algorithm Based on MDL

    Full text link
    During the past few years Boolean matrix factorization (BMF) has become an important direction in data analysis. The minimum description length principle (MDL) was successfully adapted in BMF for the model order selection. Nevertheless, a BMF algorithm performing good results from the standpoint of standard measures in BMF is missing. In this paper, we propose a novel from-below Boolean matrix factorization algorithm based on formal concept analysis. The algorithm utilizes the MDL principle as a criterion for the factor selection. On various experiments we show that the proposed algorithm outperforms---from different standpoints---existing state-of-the-art BMF algorithms

    Low Rank Approximation of Binary Matrices: Column Subset Selection and Generalizations

    Get PDF
    Low rank matrix approximation is an important tool in machine learning. Given a data matrix, low rank approximation helps to find factors, patterns and provides concise representations for the data. Research on low rank approximation usually focus on real matrices. However, in many applications data are binary (categorical) rather than continuous. This leads to the problem of low rank approximation of binary matrix. Here we are given a d×nd \times n binary matrix AA and a small integer kk. The goal is to find two binary matrices UU and VV of sizes d×kd \times k and k×nk \times n respectively, so that the Frobenius norm of AUVA - U V is minimized. There are two models of this problem, depending on the definition of the dot product of binary vectors: The GF(2)\mathrm{GF}(2) model and the Boolean semiring model. Unlike low rank approximation of real matrix which can be efficiently solved by Singular Value Decomposition, approximation of binary matrix is NPNP-hard even for k=1k=1. In this paper, we consider the problem of Column Subset Selection (CSS), in which one low rank matrix must be formed by kk columns of the data matrix. We characterize the approximation ratio of CSS for binary matrices. For GF(2)GF(2) model, we show the approximation ratio of CSS is bounded by k2+1+k2(2k1)\frac{k}{2}+1+\frac{k}{2(2^k-1)} and this bound is asymptotically tight. For Boolean model, it turns out that CSS is no longer sufficient to obtain a bound. We then develop a Generalized CSS (GCSS) procedure in which the columns of one low rank matrix are generated from Boolean formulas operating bitwise on columns of the data matrix. We show the approximation ratio of GCSS is bounded by 2k1+12^{k-1}+1, and the exponential dependency on kk is inherent.Comment: 38 page

    Mining Top-K Patterns from Binary Datasets in presence of Noise

    No full text
    The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and real-world data

    Matrix factorization over dioids and its applications in data mining

    Get PDF
    Matrix factorizations are an important tool in data mining, and they have been used extensively for finding latent patterns in the data. They often allow to separate structure from noise, as well as to considerably reduce the dimensionality of the input matrix. While classical matrix decomposition methods, such as nonnegative matrix factorization (NMF) and singular value decomposition (SVD), proved to be very useful in data analysis, they are limited by the underlying algebraic structure. NMF, in particular, tends to break patterns into smaller bits, often mixing them with each other. This happens because overlapping patterns interfere with each other, making it harder to tell them apart. In this thesis we study matrix factorization over algebraic structures known as dioids, which are characterized by the lack of additive inverse (“negative numbers”) and the idempotency of addition (a + a = a). Using dioids makes it easier to separate overlapping features, and, in particular, it allows to better deal with the above mentioned pattern breaking problem. We consider different types of dioids, that range from continuous (subtropical and tropical algebras) to discrete (Boolean algebra). Among these, the Boolean algebra is perhaps the most well known, and there exist methods that allow one to obtain high quality Boolean matrix factorizations in terms of the reconstruction error. In this work, however, a different objective function is used – the description length of the data, which enables us to obtain compact and highly interpretable results. The tropical and subtropical algebras, on the other hand, are much less known in the data mining field. While they find applications in areas such as job scheduling and discrete event systems, they are virtually unknown in the context of data analysis. We will use them to obtain idempotent nonnegative factorizations that are similar to NMF, but are better at separating the most prominent features of the data.Matrix-Faktorisierungen sind ein wichtiges Werkzeug in Data-Mining und wurden umfangreich zum Auffinden latenter Muster in den Daten verwendet. Oft erlauben sie, die Struktur vom Rauschen zu trennen, sowie Dimensionalität von der Eingabematrix wesentlich zu reduzieren. Obwohl klassische Methoden für die Matrix-Zerlegung, wie z.B. nicht negative Matrixfaktorisierung (NMF) und Singulärwertzerlegung (SVD), in der Datenanalyse sich als sehr nützlich erwiesen haben, sind sie durch die zugrunde liegende algebraische Struktur eingeschränkt. Insbesondere neigt NMF dazu, Muster in kleinere Bits zu brechen, und vermischt sie oft miteinander. Das passiert, weil überschneidende Muster sich gegenseitig stören, sodass es schwieriger ist, sie auseinander zu halten. In dieser Dissertation werden Matrix-Faktorisierungen über algebraische Strukturen, sogenannte Dioiden, untersucht, die sich durch die fehlende additive Inverse (“negative Zahlen”) und Idempotenz der Addition (a + a = a) auszeichnen. Mit Dioiden ist es einfacher überschneidende Merkmale zu trennen. Insbesondere erlauben sie besser mit dem erwähnten Musterbrechenproblem umzugehen. Es werden unterschiedliche Dioiden untersucht, die von kontinuierlichen (subtropische und tropische Algebren) bis zu diskreter (Boolesche Algebra) reichen. Unter diesen, die Boolesche Algebra ist wahrscheinlich die bekannteste, und es gibt Methoden, die ermöglichen hochwertiger Matrix-Faktorisierungen in Bezug auf den Rekonstruktionsfehler zu erzielen. In dieser Arbeit aber wird eine andere Zielfunktion verwendet: Die Länge der Beschreibung von den Daten. Die Zielfunktion ermöglicht uns kompakte und hochinterpretierbare Ergebnisse zu erzielen. Andererseits sind die tropische und subtropische Algebren viel weniger im Bereich Data-Mining bekannt. Sie finden zwar Anwendungen in Bereichen wie Job-Scheduling und diskrete Ereignissysteme, jedoch sind sie im Kontext von Datenanalyse nahezu unbekannt. Hier werden sie verwendet, um idempotente, nicht negative Faktorisierungen zu erhalten, die NMF ähneln, aber die wichtigsten Merkmale der Daten besser voneinander trennen

    A Novel Data-Driven Fault Tree Methodology for Fault Diagnosis and Prognosis

    Get PDF
    RÉSUMÉ : La thèse développe une nouvelle méthodologie de diagnostic et de pronostic de défauts dans un système complexe, nommée Interpretable logic tree analysis (ILTA), qui combine les techniques d’extraction de connaissances à partir des bases de données « knowledge discovery in database (KDD) » et l’analyse d’arbre de défaut « fault tree analysis (FTA) ». La méthodologie capitalise les avantages des deux techniques pour appréhender la problématique de diagnostic et de pronostic de défauts. Bien que les arbres de défauts offrent des modèles interprétables pour déterminer les causes possibles à l’origine d’un défaut, leur utilisation pour le diagnostic de défauts dans un système industriel est limitée, en raison de la nécessité de faire appel à des connaissances expertes pour décrire les relations de cause-à-effet entre les processus internes du système. Cependant, il sera intéressant d’exploiter la puissance d’analyse des arbres de défaut mais construit à partir des connaissances explicites et non biaisées extraites directement des bases de données sur la causalité des fautes. Par conséquent, la méthodologie ILTA fonctionne de manière analogue à la logique du modèle d'analyse d'arbre de défaut (FTA) mais avec une implication minimale des experts. Cette approche de modélisation doit rejoindre la logique des experts pour représenter la structure hiérarchique des défauts dans un système complexe. La méthodologie ILTA est appliquée à la gestion des risques de défaillance en fournissant deux modèles d'arborescence avancés interprétables à plusieurs niveaux (MILTA) et au cours du temps (ITCA). Le modèle MILTA est conçu pour accomplir la tâche de diagnostic de défaillance dans les systèmes complexes. Il est capable de décomposer un défaut complexe et de modéliser graphiquement sa structure de causalité dans un arbre à plusieurs niveaux. Par conséquent, un expert est en mesure de visualiser l’influence des relations hiérarchiques de cause à effet menant à la défaillance principale. De plus, quantifier ces causes en attribuant des probabilités aide à comprendre leur contribution dans l’occurrence de la défaillance du système. Le modèle ITCA est conçu pour réaliser la tâche de pronostic de défaillance dans les systèmes complexes. Basé sur une répartition des données au cours du temps, le modèle ITCA capture l’effet du vieillissement du système à travers de l’évolution de la structure de causalité des fautes. Ainsi, il décrit les changements de causalité résultant de la détérioration et du vieillissement au cours de la vie du système.----------ABSTRACT : The thesis develops a new methodology for diagnosis and prognosis of faults in a complex system, called Interpretable logic tree analysis (ILTA), which combines knowledge extraction techniques from knowledge discovery in databases (KDD) and the fault tree analysis (FTA). The methodology combined the advantages of the both techniques for understanding the problem of diagnosis and prognosis of faults. Although fault trees provide interpretable models for determining the possible causes of a fault, its use for fault diagnosis in an industrial system is limited, due to the need for expert knowledge to describe cause-and-effect relationships between internal system processes. However, it will be interesting to exploit the analytical power of fault trees but built from explicit and unbiased knowledge extracted directly from databases on the causality of faults. Therefore, the ILTA methodology works analogously to the logic of the fault tree analysis model (FTA) but with minimal involvement of experts. This modeling approach joins the logic of experts to represent the hierarchical structure of faults in a complex system. The ILTA methodology is applied to failure risk management by providing two interpretable advanced logic models: a multi-level tree (MILTA) and a multilevel tree over time (ITCA). The MILTA model is designed to accomplish the task of diagnosing failure in complex systems. It is able to decompose a complex defect and graphically model its causal structure in a tree on several levels. As a result, an expert is able to visualize the influence of hierarchical cause and effect relationships leading to the main failure. In addition, quantifying these causes by assigning probabilities helps to understand their contribution to the occurrence of system failure. The second model is a logical tree interpretable in time (ITCA), designed to perform the task of prognosis of failure in complex systems. Based on a distribution of data over time, the ITCA model captures the effect of the aging of the system through the evolution of the fault causation structure. Thus, it describes the causal changes resulting from deterioration and aging over the life of the system
    corecore