11 research outputs found

    A Study of Boolean Matrix Factorization Under Supervised Settings

    Get PDF
    International audienceBoolean matrix factorization is a generally accepted approach used in data analysis to explain data. It is commonly used under unsu-pervised setting or for data preprocessing under supervised settings. In this paper we study factors under supervised settings. We provide an experimental proof that factors are able to explain not only data as a whole but also classes in the data

    Matrix factorization over dioids and its applications in data mining

    Get PDF
    Matrix factorizations are an important tool in data mining, and they have been used extensively for finding latent patterns in the data. They often allow to separate structure from noise, as well as to considerably reduce the dimensionality of the input matrix. While classical matrix decomposition methods, such as nonnegative matrix factorization (NMF) and singular value decomposition (SVD), proved to be very useful in data analysis, they are limited by the underlying algebraic structure. NMF, in particular, tends to break patterns into smaller bits, often mixing them with each other. This happens because overlapping patterns interfere with each other, making it harder to tell them apart. In this thesis we study matrix factorization over algebraic structures known as dioids, which are characterized by the lack of additive inverse (“negative numbers”) and the idempotency of addition (a + a = a). Using dioids makes it easier to separate overlapping features, and, in particular, it allows to better deal with the above mentioned pattern breaking problem. We consider different types of dioids, that range from continuous (subtropical and tropical algebras) to discrete (Boolean algebra). Among these, the Boolean algebra is perhaps the most well known, and there exist methods that allow one to obtain high quality Boolean matrix factorizations in terms of the reconstruction error. In this work, however, a different objective function is used – the description length of the data, which enables us to obtain compact and highly interpretable results. The tropical and subtropical algebras, on the other hand, are much less known in the data mining field. While they find applications in areas such as job scheduling and discrete event systems, they are virtually unknown in the context of data analysis. We will use them to obtain idempotent nonnegative factorizations that are similar to NMF, but are better at separating the most prominent features of the data.Matrix-Faktorisierungen sind ein wichtiges Werkzeug in Data-Mining und wurden umfangreich zum Auffinden latenter Muster in den Daten verwendet. Oft erlauben sie, die Struktur vom Rauschen zu trennen, sowie Dimensionalität von der Eingabematrix wesentlich zu reduzieren. Obwohl klassische Methoden für die Matrix-Zerlegung, wie z.B. nicht negative Matrixfaktorisierung (NMF) und Singulärwertzerlegung (SVD), in der Datenanalyse sich als sehr nützlich erwiesen haben, sind sie durch die zugrunde liegende algebraische Struktur eingeschränkt. Insbesondere neigt NMF dazu, Muster in kleinere Bits zu brechen, und vermischt sie oft miteinander. Das passiert, weil überschneidende Muster sich gegenseitig stören, sodass es schwieriger ist, sie auseinander zu halten. In dieser Dissertation werden Matrix-Faktorisierungen über algebraische Strukturen, sogenannte Dioiden, untersucht, die sich durch die fehlende additive Inverse (“negative Zahlen”) und Idempotenz der Addition (a + a = a) auszeichnen. Mit Dioiden ist es einfacher überschneidende Merkmale zu trennen. Insbesondere erlauben sie besser mit dem erwähnten Musterbrechenproblem umzugehen. Es werden unterschiedliche Dioiden untersucht, die von kontinuierlichen (subtropische und tropische Algebren) bis zu diskreter (Boolesche Algebra) reichen. Unter diesen, die Boolesche Algebra ist wahrscheinlich die bekannteste, und es gibt Methoden, die ermöglichen hochwertiger Matrix-Faktorisierungen in Bezug auf den Rekonstruktionsfehler zu erzielen. In dieser Arbeit aber wird eine andere Zielfunktion verwendet: Die Länge der Beschreibung von den Daten. Die Zielfunktion ermöglicht uns kompakte und hochinterpretierbare Ergebnisse zu erzielen. Andererseits sind die tropische und subtropische Algebren viel weniger im Bereich Data-Mining bekannt. Sie finden zwar Anwendungen in Bereichen wie Job-Scheduling und diskrete Ereignissysteme, jedoch sind sie im Kontext von Datenanalyse nahezu unbekannt. Hier werden sie verwendet, um idempotente, nicht negative Faktorisierungen zu erhalten, die NMF ähneln, aber die wichtigsten Merkmale der Daten besser voneinander trennen

    Mobile analytics database summarization using rough set

    Get PDF
    The mobile device is a device that supports the mobility activities and more portable. However, mobile devices have the limited resources and storage capacity. This deficiency should be considered in order to maximize the functionality of this mobile device. Hence, this study provides a formulation in data management to support a process of storing data with large scale by using Rough Set approach to select the data with relevant and useful information. Additionally, the features are combining analytics method to complete analysis of the data storage processing, making users more easily understand how to read the analysis results. Testing is done by utilizing data from the Malaysia’s Open Government Data about Air Pollutant Index (API) to determine the condition of the air pollution level to the health and safety of the population. The testing has successfully created a summary of the API data with the Rough Set approach to select significant data from the main database based on generated rule. The analysis results of the selected API data are stored as a mobile database and presented in the chart intended to make the data meaningful and easier to understand the analysis results of API conditions using the mobile device

    Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks

    Full text link
    Abstract. Matrix factorizations are a popular tool to mine regularities from data. There are many ways to interpret the factorizations, but one particularly suited for data mining utilizes the fact that a matrix product can be interpreted as a sum of rank-1 matrices. Then the factorization of a matrix becomes the task of finding a small number of rank-1 matrices, sum of which is a good representation of the original matrix. Seen this way, it becomes obvious that many problems in data mining can be expressed as matrix factorizations with correct definitions of what a rank-1 matrix and a sum of rank-1 matrices mean. This paper develops a unified theory, based on generalized outer product operators, that encompasses many pattern set mining tasks. The focus is on the computational aspects of the theory and studying the computational complexity and approximability of many problems related to generalized matrix factorizations. The results immediately apply to a large number of data mining problems, and hopefully allow generalizing future results and algorithms, as well.

    Multi-purpose exploratory mining of complex data

    Get PDF
    Due to the increasing power of data acquisition and data storage technologies, a large amount of data sets with complex structure are collected in the era of data explosion. Instead of simple representations by low-dimensional numerical features, such data sources range from high-dimensional feature spaces to graph data describing relationships among objects. Many techniques exist in the literature for mining simple numerical data but only a few approaches touch the increasing challenge of mining complex data, such as high-dimensional vectors of non-numerical data type, time series data, graphs, and multi-instance data where each object is represented by a finite set of feature vectors. Besides, there are many important data mining tasks for high-dimensional data, such as clustering, outlier detection, dimensionality reduction, similarity search, classification, prediction and result interpretation. Many algorithms have been proposed to solve these tasks separately, although in some cases they are closely related. Detecting and exploiting the relationships among them is another important challenge. This thesis aims to solve these challenges in order to gain new knowledge from complex high-dimensional data. We propose several new algorithms combining different data mining tasks to acquire novel knowledge from complex high-dimensional data: ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data) automatically detects the most relevant overlapping subspace clusters on categorical data. It integrates clustering, feature selection and pattern mining without any input parameters in an information theoretic way. The next algorithm MSS (Multiple Subspace Selection) finds multiple low-dimensional subspaces for moderately high-dimensional data, each exhibiting an interesting cluster structure. For better interpretation of the results, MSS visualizes the clusters in multiple low-dimensional subspaces in a hierarchical way. SCMiner (Summarization-Compression Miner) focuses on bipartite graph data, which integrates co-clustering, graph summarization, link prediction, and the discovery of the hidden structure of a bipartite graph data on the basis of data compression. Finally, we propose a novel similarity measure for multi-instance data. The Probabilistic Integral Metric (PIM) is based on a probabilistic generative model requiring few assumptions. Experiments demonstrate the effectiveness and efficiency of PIM for similarity search (multi-instance data indexing with M-tree), explorative data analysis and data mining (multi-instance classification). To sum up, we propose algorithms combining different data mining tasks for complex data with various data types and data structures to discover the novel knowledge hidden behind the complex data

    A mathematical theory of making hard decisions: model selection and robustness of matrix factorization with binary constraints

    Get PDF
    One of the first and most fundamental tasks in machine learning is to group observations within a dataset. Given a notion of similarity, finding those instances which are outstandingly similar to each other has manifold applications. Recommender systems and topic analysis in text data are examples which are most intuitive to grasp. The interpretation of the groups, called clusters, is facilitated if the assignment of samples is definite. Especially in high-dimensional data, denoting a degree to which an observation belongs to a specified cluster requires a subsequent processing of the model to filter the most important information. We argue that a good summary of the data provides hard decisions on the following question: how many groups are there, and which observations belong to which clusters? In this work, we contribute to the theoretical and practical background of clustering tasks, addressing one or both aspects of this question. Our overview of state-of-the-art clustering approaches details the challenges of our ambition to provide hard decisions. Based on this overview, we develop new methodologies for two branches of clustering: the one concerns the derivation of nonconvex clusters, known as spectral clustering; the other addresses the identification of biclusters, a set of samples together with similarity defining features, via Boolean matrix factorization. One of the main challenges in both considered settings is the robustness to noise. Assuming that the issue of robustness is controllable by means of theoretical insights, we have a closer look at those aspects of established clustering methods which lack a theoretical foundation. In the scope of Boolean matrix factorization, we propose a versatile framework for the optimization of matrix factorizations subject to binary constraints. Especially Boolean factorizations have been computed by intuitive methods so far, implementing greedy heuristics which lack quality guarantees of obtained solutions. In contrast, we propose to build upon recent advances in nonconvex optimization theory. This enables us to provide convergence guarantees to local optima of a relaxed objective, requiring only approximately binary factor matrices. By means of this new optimization scheme PAL-Tiling, we propose two approaches to automatically determine the number of clusters. The one is based on information theory, employing the minimum description length principle, and the other is a novel statistical approach, controlling the false discovery rate. The flexibility of our framework PAL-Tiling enables the optimization of novel factorization schemes. In a different context, where every data point belongs to a pre-defined class, a characterization of the classes may be obtained by Boolean factorizations. However, there are cases where this traditional factorization scheme is not sufficient. Therefore, we propose the integration of another factor matrix, reflecting class-specific differences within a cluster. Our theoretical considerations are complemented by empirical evaluations, showing how our methods combine theoretical soundness with practical advantages
    corecore