5,667 research outputs found

    Analyze Large Multidimensional Datasets Using Algebraic Topology

    Get PDF
    This paper presents an efficient algorithm to extract knowledge from high-dimensionality, high- complexity datasets using algebraic topology, namely simplicial complexes. Based on concept of isomorphism of relations, our method turn a relational table into a geometric object (a simplicial complex is a polyhedron). So, conceptually association rule searching is turned into a geometric traversal problem. By leveraging on the core concepts behind Simplicial Complex, we use a new technique (in computer science) that improves the performance over existing methods and uses far less memory. It was designed and developed with a strong emphasis on scalability, reliability, and extensibility. This paper also investigate the possibility of Hadoop integration and the challenges that come with the framework

    MINING CONCEPT IN BIG DATA

    Get PDF
    To fruitful using big data, data mining is necessary. There are two well-known methods, one is based on apriori principle, and the other one is based on FP-tree. In this project we explore a new approach that is based on simplicial complex, which is a combinatorial form of polyhedron used in algebraic topology. Our approach, similar to FP-tree, is top down, at the same time, it is based on apriori principle in geometric form, called closed condition in simplicial complex. Our method is almost 300 times faster than FP-growth on a real world database using a SJSU laptop. The database is provided by hospital of National Taiwan University. It has 65536 transactions and 1257 columns in bit form. Our major work is mining concepts from big text data; this project is the core engine of the concept based semantic search engine

    Clustering Web Concepts Using Algebraic Topology

    Get PDF
    In this world of Internet, there is a rapid amount of growth in data both in terms of size and dimension. It consists of web pages that represents human thoughts. These thoughts involves concepts and associations which we can capture. Using mathematics, we can perform meaningful clustering of these pages. This project aims at providing a new problem solving paradigm known as algebraic topology in data science. Professor Vasant Dhar, Editor-In-Chief of Big Data (Professor at NYU) define data science as a generalizable extraction of knowledge from data. The core concept of semantic based search engine project developed by my team is to extract a high frequency finite sequence of keywords by association mining. Each frequent finite keywords sequences represent a human concept in a document set. The collective view of such a collection concepts represent a piece of human knowledge. So this MS project is a data science project. By regarding each keyword as an abstract vertex, a finite sequence of keywords becomes a simplex, and the collection becomes a simplicial complexes. Based on this geometric view, new type of clustering can be performed here. If two concepts are connected by n-simplex, we say that these two simplex are connected. Those connected components will be captured by Homology Theory of Simplicial Complexes. The input data for this project are ten thousand files about data mining which are downloaded from IEEE explore library. The search engine nowadays deals with large amount of high dimensional data. Applying mathematical concepts and measuring the connectivity for ten thousand files will be a real challenge. Since, using algebraic topology is a complete new approach. Therefore, extensive testing has to be performed to verify the results for homology groups obtained

    Concept Based Document Clustering using a Simplicial Complex, a Hypergraph

    Get PDF
    This thesis evaluates the effectiveness of using a combinatorial topology structure (a simplicial complex) for document clustering. It is believed that a simplicial complex better identifies the latent concept space defined by a collection of documents than the use of hypergraphs or human categorization. The complex is constructed using groups of co-occurring words (term associations) identified using traditional data mining methods. Disjoint subsections of the complex (connect components) represent general concepts within the documents’ concept space. Documents clustered to these connect components will produce meaningful groupings. Instead, the most specific concepts (maximal simplices) are used as representative connect components to demonstrate this technique’s effectiveness. Each document in a cluster is compared against its human assigned category to determine the cluster’s precision. It is shown that this technique is better able to cluster documents than human classifiers

    Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

    Full text link
    In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
    • …
    corecore