11 research outputs found

    Tree mining application to matching of hetereogeneous knowledge

    Get PDF
    Matching of heterogeneous knowledge sources is of increasing importance in areas such as scientific knowledge management, e-commerce, enterprise application integration, and many emerging Semantic Web applications. With the desire of knowledge sharing and reuse in these fields, it is common that the knowledge coming from different organizations from the same domain is to be matched. We propose a knowledge matching method based on our previously developed tree mining algorithms for extracting frequently occurring subtrees from a tree structured database such as XML. Using the method the common structure among the different representations can be automatically extracted. Our focus is on knowledge matching at the structural level and we use a set of example XML schema documents from the same domain to evaluate the method. We discuss some important issues that arise when applying tree mining algorithms for detection of common document structures. The experiments demonstrate the usefulness of the approach

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    An approximate search engine for structure

    Get PDF
    As the size of structural databases grows, the need for efficiently searching these databases arises. Thanks to previous and ongoing research, searching by attribute-value and by text has become commonplace in these databases. However, searching by topological or physical structure, especially for large databases and especially for approximate matches, is still an art. In this dissertation, efficient search techniques are presented for retrieving trees from a database that are similar to a given query tree. Rooted ordered labeled trees, rooted unordered labeled trees and free trees are considered. Ordered labeled trees are trees in which each node has a label and the left-to-right order among siblings matters. Unordered labeled trees are trees in which the parent-child relationship is significant, but the order among siblings is unimportant. Free trees (unrooted unordered trees) are acyclic graphs. These trees find many applications in bioinformatics, Web log analysis, phyloinformatics, XML processing, etc. Two types of similarity measures are investigated: (i) counting the mismatching paths in the query tree and a data tree, and (ii) measuring the topological relationship between the trees. The proposed approaches include storing the paths of trees in a suffix array, employing hashing techniques to speed up retrieval, and counting the number of up-down operations to move a token from one node to another node in a tree. Various filters for accelerating a search, different strategies for parallelizing these search algorithms and applications of these algorithms to XML and phylogenetic data management are discussed. The proposed techniques have been implemented into a phylogenetic search engine which is fully operational and is available on the World Wide Web. Experimental results on comparing the similarity measures with existing tree metrics and on evaluating the efficiency of the search techniques demonstrate the effectiveness of the search engine. Future work includes extending the techniques to other structural data, as well as developing new filters and algorithms for speeding up searching and mining in complex structures

    An algorithm for finding the largest approximately common substructures of two trees

    No full text
    Abstract | Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is xed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing. We consider a substructure of an ordered labeled tree T to be a connected subgraph of T. Given two ordered labeled trees T1 and T2 and an integer d, the largest approximately common substructure problem is to nd a substructure U1 of T1 and a substructure U2 of T2 such that U1 is within edit distance d of U2 and where there does not exist any other substructure V1 of T1 and V2 of T2 such that V1 and V2 satisfy the distance constraint and the sum of the sizes of V1 and V2 is greater than the sum of the sizes of U1 and U2. We present a dynamic programming algorithm to solve this problem, which runs as fast as the fastest known algorithm for computing the edit distance of two trees when the distance allowed in the common substructures is a constant independent ofthe input trees. To demonstrate the utility of our algorithm, we discuss its application to discovering motifs in multiple RNA secondary structures (which are ordered labeled trees)

    A data science approach to pattern discovery in complex structures with applications in bioinformatics

    Get PDF
    Pattern discovery aims to find interesting, non-trivial, implicit, previously unknown and potentially useful patterns in data. This dissertation presents a data science approach for discovering patterns or motifs from complex structures, particularly complex RNA structures. RNA secondary and tertiary structure motifs are very important in biological molecules, which play multiple vital roles in cells. A lot of work has been done on RNA motif annotation. However, pattern discovery in RNA structure is less studied. In the first part of this dissertation, an ab initio algorithm, named DiscoverR, is introduced for pattern discovery in RNA secondary structures. This algorithm works by representing RNA secondary structures as ordered labeled trees and performs tree pattern discovery using a quadratic time dynamic programming algorithm. The algorithm is able to identify and extract the largest common substructures from two RNA molecules of different sizes, without prior knowledge of locations and topologies of these substructures. One application of DiscoverR is to locate the RNA structural elements in genomes. Experimental results show that this tool complements the currently used approaches for mining conserved structural RNAs in the human genome. DiscoverR can also be extended to find repeated regions in an RNA secondary structure. Specifically, this extended method is used to detect structural repeats in the 3\u27-untranslated region of a protein kinase gene

    Situational Assessment using graph comparison

    Full text link
    In strategic operations, the assessment of any given situation is very important and may trigger the development of a mission plan. The mission plan consists of various actions that should be executed in order to successfully mitigate the situation. For a new mission plan to be designed or implemented, the effect of the previous mission plan should be accessed. These mission plans use various sensors to collect the data which can be very large and aggregate them to obtain detailed information of the situation. In order to implement an effective mission plan the current situation has to be assessed effectively. We propose to model the situation as a graph in which the nodes denote the participants and edges denotes relationships between participants. Situational assessment for a given situation consists of identifying the current participants and the relationships between current participants. We model these participants as vertices of a graph and the relationships between the participants as weighted arcs. As events happen the situation changes, so does the graph. Changes in the graph can be dramatical or negligible. We derive the similarity between the two graphs at different moments of time. By doing so we will be able to see the effect of the event that caused the change in the graph structure. We are comparing the similarities of the graphs using the concept of minimum spanning tree. The minimum spanning tree of a graph is a rough estimate of the details of the nodes and the edges of the graph. We therefore propose a new way of assessing a situation and a new way of analyzing the differences between the same set of participants at various intervals of time

    Design and implementation of a cyberinfrastructure for RNA motif search, prediction and analysis

    Get PDF
    RNA secondary and tertiary structure motifs play important roles in cells. However, very few web servers are available for RNA motif search and prediction. In this dissertation, a cyberinfrastructure, named RNAcyber, capable of performing RNA motif search and prediction, is proposed, designed and implemented. The first component of RNAcyber is a web-based search engine, named RmotifDB. This web-based tool integrates an RNA secondary structure comparison algorithm with the secondary structure motifs stored in the Rfam database. With a user-friendly interface, RmotifDB provides the ability to search for ncRNA structure motifs in both structural and sequential ways. The second component of RNAcyber is an enhanced version of RmotifDB. This enhanced version combines data from multiple sources, incorporates a variety of well-established structure-based search methods, and is integrated with the Gene Ontology. To display RmotifDB’s search results, a software tool, called RSview, is developed. RSview is able to display the search results in a graphical manner. Finally, RNAcyber contains a web-based tool called Junction-Explorer, which employs a data mining method for predicting tertiary motifs in RNA junctions. Specifically, the tool is trained on solved RNA tertiary structures obtained from the Protein Data Bank, and is able to predict the configuration of coaxial helical stacks and families (topologies) in RNA junctions at the secondary structure level. Junction-Explorer employs several algorithms for motif prediction, including a random forest classification algorithm, a pseudoknot removal algorithm, and a feature ranking algorithm based on the gini impurity measure. A series of experiments including 10-fold cross- validation has been conducted to evaluate the performance of the Junction-Explorer tool. Experimental results demonstrate the effectiveness of the proposed algorithms and the superiority of the tool over existing methods. The RNAcyber infrastructure is fully operational, with all of its components accessible on the Internet

    Efficient Algorithms for Local Forest Similarity

    Get PDF
    An ordered labelled tree is a tree where the left-to-right order among siblings is significant. Ordered labelled forests are sequences of ordered labelled trees. Given two ordered labelled forests F and G. the local forest similarity is to find two sub­ forests F\u27 and G\u27 of F and G respectively such that they are the most similar over all possible F\u27 and G\u27. In this thesis, we present efficient algorithms for the local forest similarity problem for two types of sub-forests: sibling subforests and closed subforests. Our algorithms can be used to locate the structural regions in RNA secondary structures since RNA molecules’ secondary structures could be represented as ordered labelled forests

    Matching XML schemas by a new tree matching algorithm.

    Get PDF
    Schema matching is one of the key operations in XML-based information integration and exchanging applications. Automatically or semi-automatically matching XML schemas has attracted a lot of attentions in academia and industry due to the extensive adoption of XML techniques. One of the most difficult tasks in this problem is to identify structural relations between two XML schemas. This thesis builds an automatic XML schema matching system which generates two types of outputs: the element mappings and schema similarity. In this system, an XML schema is modeled as a tree. We have proposed a new tree matching algorithm to compute the structural relation by extracting the most similar common substructures. The algorithm has been designed to achieve a trade-off between matching optimality and time complexity. The experimental results show that this system can be used to match large XML schemas.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .W366. Source: Masters Abstracts International, Volume: 43-05, page: 1758. Adviser: Jianguo Lu. Thesis (M.Sc.)--University of Windsor (Canada), 2004

    3D Shape Similarity Through Structural Descriptors

    Get PDF
    Due to the recent improvements to 3D object acquisition, visualization and modeling techniques, the number of 3D models available is more and more growing, and there is an increasing demand for tools supporting the automatic search for 3D objects and their sub-parts in digital archives. Whilst there are already techniques for rapidly extracting knowledge from massive volumes of texts (like Google [htt]) it is harder to structure, filter, organize, retrieve and maintain archives of digital shapes like images, 3D objects, 3D animations and virtual or augmented reality. This situations suggests that in the future a primary challenge in computer graphics will be how to find models having a similar global and/or local appearance. Shape descriptors and the methodologies used to compare them, occupy an important role for achieving this task. For this reason a first contribution of this thesis is to provide a critical analysis of the most representative geometric and structural shape descriptors with respect to a set of properties that shape descriptors should have. This analysis is targeted at highlighting the differences between descriptors in order to better understand where a descriptor fails and another succeed. As a second contribution, the thesis investigates the problem of using a structural descriptor for shape comparison purposes. A large class of structural shape descriptors can be easily encoded as directed, a-cyclic and attributed graphs, thus the problem of comparing structural descriptors is approached as a graph matching problem. The techniques used for graph comparison have an exponential computational complexity and it is therefore necessary to define an algorithmic approximation of the optimal solution. The methods for structural descriptors comparison, commonly used in the computer graphics community, consist of heuristic graph matching algorithms for specific application tasks, while it is lacking a general approach suitable for incorporating different heuristics applicable in different application tasks. The second contribution presented in this thesis is aimed at defining a framework for expressing the optimal algorithm for the computation of the maximal common subgraph in a formalization which makes it straightforward usable for plugging heuristics in it, in order to achieving different approximations of the optimal solution according to the specific case. Implemented heuristics for robust graph matching with respect to graph structural noise are discussed and experimented on sub-part correspondence between similar 3D objects, and shape retrieval application with respect to different structural graph descriptors
    corecore