268,115 research outputs found
Conjoint data mining of structured and semi-structured data
With the knowledge management requirement growing, enterprises are becoming increasingly aware of the significance of interlinking business information across structured and semi-structured data sources. This problem has become more important with the growing amount of semi-structured data often found in XML repositories, web logs, biological databases, etc. Effectively creating links between semi-structured and structured data is a challenging and unresolved problem. Once an optimized method has been formulated, the process of data mining can be implemented in a conjoint manner. This paper investigates a way in which this challenging problem can be tackled. The proposed method is experimentally evaluated using a real world database and the effectiveness and the potential in discovering collective information is demonstrated
Recommended from our members
Structured low complexity data mining
textDue to the rapidly increasing dimensionality of modern datasets many classical approximation algorithms have run into severe computational bottlenecks. This has often been referred to as the “curse of dimensionality.” To combat this, low complexity priors have been used as they enable us to design efficient approximation algorithms which are capable of scaling up to these modern datasets. Typically the reduction in computational complexity comes at the expense of accuracy. However, the tradeoffs have been relatively advantageous to the computational scientist. This is typically referred to as the “blessings of dimensionality.” Solving large underdetermined systems of linear equations has benefited greatly from the sparsity low complexity prior. A priori, solving a large underdetermined system of linear equations is severely ill-posed. However, using a relatively generic class of sampling matrices, assuming a sparsity prior can yield a well-posed linear system of equations. In particular, various greedy iterative approximation algorithms have been developed which can recover and accurately approximate the k-most significant atoms in our signal. For many engineering applications, the distribution of the top k atoms is not arbitrary and itself has some further structure. In the first half of the thesis we will be concerned with incorporating some a priori designed weights to allow for structured sparse approximation. We provide performance guarantees and numerically demonstrate how the appropriate use of weights can yield a simultaneous reduction in sample complexity and an improvement in approximation accuracy. In the second half of the thesis we will consider the collaborative filtering problem, specifically the task of matrix completion. The matrix completion problem is likewise severely ill-posed but with a low rank prior, the matrix completion problem with high probability admits a unique and robust solution via a cadre of convex optimization solvers. The drawback here is that the solvers enjoy strong theoretical guarantees only in the uniform sampling regime. Building upon recent work on non-uniform matrix completion, we propose a completely expert-free empirical procedure to design optimization parameters in the form of positive weights which allow for the recovery of arbitrarily sampled low rank matrices. We provide theoretical guarantees for these empirically learned weights and present numerical simulations which again show that encoding prior knowledge in the form of weights for optimization problems can again yield a simultaneous reduction in sample complexity and an improvement in approximation accuracy.Mathematic
Ontology of core data mining entities
In this article, we present OntoDM-core, an ontology of core data mining
entities. OntoDM-core defines themost essential datamining entities in a three-layered
ontological structure comprising of a specification, an implementation and an application
layer. It provides a representational framework for the description of mining
structured data, and in addition provides taxonomies of datasets, data mining tasks,
generalizations, data mining algorithms and constraints, based on the type of data.
OntoDM-core is designed to support a wide range of applications/use cases, such as
semantic annotation of data mining algorithms, datasets and results; annotation of
QSAR studies in the context of drug discovery investigations; and disambiguation of
terms in text mining. The ontology has been thoroughly assessed following the practices
in ontology engineering, is fully interoperable with many domain resources and
is easy to extend
Mining Projects from Structured and Unstructured Data
Companies working on safety-critical projects must adhere to strict rules imposed by
the domain, especially when human safety is involved. These projects need to be compliant to
standard norms and regulations. Thus, all the process steps must be clearly documented in order
to be verifiable for compliance in a later stage by an auditor. Nevertheless, documentation often
comes in the form of manually written textual documents in different formats. Moreover, the project
members use diverse proprietary tools. This makes it difficult for auditors to understand how the
actual project was conducted. My research addresses the project mining problem by exploiting logs
from project-generated artifacts, which come from software repositories used by the project team
Efficient Mining of Heterogeneous Star-Structured Data
Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing star- structured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive exper- iments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm
- …