7,109 research outputs found
Analyze Large Multidimensional Datasets Using Algebraic Topology
This paper presents an efficient algorithm to extract knowledge from high-dimensionality, high- complexity datasets using algebraic topology, namely simplicial complexes. Based on concept of isomorphism of relations, our method turn a relational table into a geometric object (a simplicial complex is a polyhedron). So, conceptually association rule searching is turned into a geometric traversal problem. By leveraging on the core concepts behind Simplicial Complex, we use a new technique (in computer science) that improves the performance over existing methods and uses far less memory. It was designed and developed with a strong emphasis on scalability, reliability, and extensibility. This paper also investigate the possibility of Hadoop integration and the challenges that come with the framework
Incremental Entity Resolution from Linked Documents
In many government applications we often find that information about
entities, such as persons, are available in disparate data sources such as
passports, driving licences, bank accounts, and income tax records. Similar
scenarios are commonplace in large enterprises having multiple customer,
supplier, or partner databases. Each data source maintains different aspects of
an entity, and resolving entities based on these attributes is a well-studied
problem. However, in many cases documents in one source reference those in
others; e.g., a person may provide his driving-licence number while applying
for a passport, or vice-versa. These links define relationships between
documents of the same entity (as opposed to inter-entity relationships, which
are also often used for resolution). In this paper we describe an algorithm to
cluster documents that are highly likely to belong to the same entity by
exploiting inter-document references in addition to attribute similarity. Our
technique uses a combination of iterative graph-traversal, locality-sensitive
hashing, iterative match-merge, and graph-clustering to discover unique
entities based on a document corpus. A unique feature of our technique is that
new sets of documents can be added incrementally while having to re-resolve
only a small subset of a previously resolved entity-document collection. We
present performance and quality results on two data-sets: a real-world database
of companies and a large synthetically generated `population' database. We also
demonstrate benefit of using inter-document references for clustering in the
form of enhanced recall of documents for resolution.Comment: 15 pages, 8 figures, patented wor
FP-tree and COFI Based Approach for Mining of Multiple Level Association Rules in Large Databases
In recent years, discovery of association rules among itemsets in a large
database has been described as an important database-mining problem. The
problem of discovering association rules has received considerable research
attention and several algorithms for mining frequent itemsets have been
developed. Many algorithms have been proposed to discover rules at single
concept level. However, mining association rules at multiple concept levels may
lead to the discovery of more specific and concrete knowledge from data. The
discovery of multiple level association rules is very much useful in many
applications. In most of the studies for multiple level association rule
mining, the database is scanned repeatedly which affects the efficiency of
mining process. In this research paper, a new method for discovering multilevel
association rules is proposed. It is based on FP-tree structure and uses
cooccurrence frequent item tree to find frequent items in multilevel concept
hierarchy.Comment: Pages IEEE format, International Journal of Computer Science and
Information Security, IJCSIS, Vol. 7 No. 2, February 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis
- …