575 research outputs found
Pattern discovery in trees : algorithms and applications to document and scientific data management
Ordered, labeled trees are trees in which each node has a label and the left-to-right order of its children (if it has any) is fixed. Such trees have many applications in vision, pattern recognition, molecular biology and natural language processing.
In this dissertation we present algorithms for finding patterns in the ordered labeled trees. Specifically we study the largest approximately common substructure (LACS) problem for such trees. We consider a substructure of a tree T to be a connected subgraph of T. Given two trees T1, T2 and an integer d, the LACS problem is to find a substructure U1 of T1 and a substructure U2 of T2 such that U1 is within distance d of U2 and where there does not exist any other substructure V1 of T1 and V2 of T2 such that V1 and V2 satisfy the distance constraint and the sum of the sizes of V1 and V2 is greater than the sum of the sizes of U1 and U2. The LACS problem is motivated by the studies of document and RNA comparison.
We consider two types of distance measures: the general edit distance and a restricted edit distance originated from Selkow. We present dynamic programming algorithms to solve the LACS problem based on the two distance measures. The algorithms run as fast as the best known algorithms for computing the distance of two trees when the distance allowed in the common substructures is a constant independent of the input trees. To demonstrate the utility of our algorithms, we discuss their applications to discovering motifs in multiple RNA secondary structures.
Such an application shows an example of scientific data mining. We represent an RNA secondary structure by an ordered labeled tree based on a previously proposed scheme. The patterns in the trees are substructures that can differ in both substitutions and deletions/insertions of nodes of the trees. Our techniques incorporate approximate tree matching algorithms and novel heuristics for discovery and optimization. Experimental results obtained by running these algorithms on both generated data and RNA secondary structures show the good performance of the algorithms. It is shown that the optimization heuristics speed up the discovery algorithm by a factor of 10. Moreover, our optimized approach is 100,000 times faster than the brute force method.
Finally we implement our techniques into a graphic toolbox that enables users to find repeated substructures in an RNA secondary structure as well as frequently occurring patterns in multiple RNA secondary structures pertaining to rhinovirus obtained from the National Cancer Institute. The system is implemented in C programming language and X windows and is fully operational on SUN workstations
Local Similarity Between Quotiented Ordered Trees
International audienceIn this paper we propose a dynamic programming algorithm to evaluate local similarity between ordered quotiented trees using a constrained edit scoring scheme. A quotiented tree is a tree defined with an additional equivalent relation on vertices and such that the quotient graph is also a tree. The core of the method relies on two adaptations of an algorithm proposed by Zhang et al. [K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems (1989) 1245-1262] for comparing ordered rooted trees. After some preliminary definitions and the description of this tree edit algorithm, we propose extensions to globally and locally compare two quotiented trees. This last method allows to find the region in each tree with the highest similarity. Algorithms are currently being used in genomic analysis to evaluate variability between RNA secondary structures
A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Online Analysis of Dynamic Streaming Data
Die Arbeit zum Thema "Online Analysis of Dynamic Streaming Data" beschäftigt sich mit der Distanzmessung dynamischer, semistrukturierter Daten in kontinuierlichen Datenströmen um Analysen auf diesen Datenstrukturen bereits zur Laufzeit zu ermöglichen. Hierzu wird eine Formalisierung zur Distanzberechnung für statische und dynamische Bäume eingeführt und durch eine explizite Betrachtung der Dynamik von Attributen einzelner Knoten der Bäume ergänzt. Die Echtzeitanalyse basierend auf der Distanzmessung wird durch ein dichte-basiertes Clustering ergänzt, um eine Anwendung des Clustering, einer Klassifikation, aber auch einer Anomalieerkennung zu demonstrieren.
Die Ergebnisse dieser Arbeit basieren auf einer theoretischen Analyse der eingeführten Formalisierung von Distanzmessungen für dynamische Bäume. Diese Analysen werden unterlegt mit empirischen Messungen auf Basis von Monitoring-Daten von Batchjobs aus dem Batchsystem des GridKa Daten- und Rechenzentrums. Die Evaluation der vorgeschlagenen Formalisierung sowie der darauf aufbauenden Echtzeitanalysemethoden zeigen die Effizienz und Skalierbarkeit des Verfahrens. Zudem wird gezeigt, dass die Betrachtung von Attributen und Attribut-Statistiken von besonderer Bedeutung für die Qualität der Ergebnisse von Analysen dynamischer, semistrukturierter Daten ist. Außerdem zeigt die Evaluation, dass die Qualität der Ergebnisse durch eine unabhängige Kombination mehrerer Distanzen weiter verbessert werden kann. Insbesondere wird durch die Ergebnisse dieser Arbeit die Analyse sich über die Zeit verändernder Daten ermöglicht
Structural Pattern Recognition for Chemical-Compound Virtual Screening
Les molècules es configuren de manera natural com a xarxes, de manera que sĂłn ideals per estudiar utilitzant les seves representacions grĂ fiques, on els nodes representen Ă toms i les vores representen els enllaços quĂmics. Una alternativa per a aquesta representaciĂł directa Ă©s el grĂ fic reduĂŻt ampliat, que resumeix les estructures quĂmiques mitjançant descripcions de nodes de tipus farmacòfor per codificar les propietats moleculars rellevants. Un cop tenim una manera adequada de representar les molècules com a grĂ fics, hem de triar l’eina adequada per comparar-les i analitzar-les. La distĂ ncia d'ediciĂł de grĂ fics s'utilitza per resoldre la concordança de grĂ fics tolerant als errors; aquesta metodologia calcula la distĂ ncia entre dos grĂ fics determinant el nombre mĂnim de modificacions necessĂ ries per transformar un grĂ fic en l’altre. Aquestes modificacions (conegudes com a operacions d’ediciĂł) tenen associat un cost d’ediciĂł (tambĂ© conegut com a cost de transformaciĂł), que s’ha de determinar en funciĂł del problema. Aquest estudi investiga l’eficĂ cia d’una comparaciĂł molecular basada nomĂ©s en grĂ fics que utilitza grĂ fics reduĂŻts ampliats i distĂ ncia d’ediciĂł de grĂ fics com a eina per a aplicacions de cribratge virtual basades en lligands. Aquestes aplicacions estimen la bioactivitat d'una substĂ ncia quĂmica que utilitza la bioactivitat de compostos similars. Una part essencial d’aquest estudi es centra en l’ús d’aprenentatge automĂ tic i tècniques de processament del llenguatge natural per optimitzar els costos de transformaciĂł utilitzats en les comparacions moleculars amb la distĂ ncia d’ediciĂł de grĂ fics.Las molĂ©culas tienen la forma natural de redes, lo que las hace ideales para estudiar mediante el empleo de sus representaciones gráficas, donde los nodos representan los átomos y los bordes representan los enlaces quĂmicos. Una alternativa para esta representaciĂłn sencilla es el gráfico reducido extendido, que resume las estructuras quĂmicas utilizando descripciones de nodos de tipo farmacĂłforo para codificar las propiedades moleculares relevantes. Una vez que tenemos una forma adecuada de representar molĂ©culas como gráficos, debemos elegir la herramienta adecuada para compararlas y analizarlas. La distancia de ediciĂłn de gráficos se utiliza para resolver la coincidencia de gráficos tolerante a errores; esta metodologĂa estima una distancia entre dos gráficos determinando el nĂşmero mĂnimo de modificaciones necesarias para transformar un gráfico en el otro. Estas modificaciones (conocidas como operaciones de ediciĂłn) tienen un costo de ediciĂłn (tambiĂ©n conocido como costo de transformaciĂłn) asociado, que debe determinarse en funciĂłn del problema. Este estudio investiga la efectividad de una comparaciĂłn molecular basada solo en gráficos que emplea gráficos reducidos extendidos y distancia de ediciĂłn de gráficos como una herramienta para aplicaciones de detecciĂłn virtual
basadas en ligandos. Estas aplicaciones estiman la bioactividad de una sustancia quĂmica empleando la bioactividad de compuestos similares. Una parte esencial de este estudio se centra en el uso de tĂ©cnicas de procesamiento de lenguaje natural y aprendizaje automático para optimizar los costos de transformaciĂłn utilizados en las comparaciones moleculares con la distancia de ediciĂłn de gráficos.Molecules are naturally shaped as networks, making them ideal for studying by employing their graph representations, where nodes represent atoms and edges represent the chemical bonds. An alternative for this straightforward representation is the extended reduced graph, which summarizes the chemical structures using pharmacophore-type node descriptions to encode the relevant molecular properties. Once we have a suitable way to represent molecules as graphs, we need to choose the right tool to compare and analyze them. Graph edit distance is used to solve the error-tolerant graph matching; this methodology estimates a distance between two graphs by determining the minimum number of modifications required to transform one graph into the other. These modifications (known as edit operations) have an edit cost (also known as transformation cost) associated, which must be determined depending on the problem.
This study investigates the effectiveness of a graph-only driven molecular comparison employing extended reduced graphs and graph edit distance as a tool for ligand-based virtual screening applications. Those applications estimate the bioactivity of a chemical employing the bioactivity of similar compounds. An essential part of this study focuses on using machine learning and natural language processing techniques to optimize the transformation costs used in the molecular comparisons with the graph edit distance.
Overall, this work shows a framework that combines graph reduction and comparison with optimization tools and natural language processing to identify bioactivity similarities in a structurally diverse group of molecules. We confirm the efficiency of this framework with several chemoinformatic tests applied to regression and classification problems over different publicly available datasets
Developing Guidelines for Two-Dimensional Model Review and Acceptance
Two independent modelers ran two hydraulic models, SRH-2D and HEC-RAS 2D. The models were applied to the Lakina River (MP 44 McCarthy Road) and to Quartz Creek (MP 0.7 Quartz Creek Road), which approximately represent straight and bend flow conditions, respectively. We compared the results, including water depth, depth averaged velocity, and bed shear stress, from the two models for both modelers.
We found that the extent and density of survey data were insufficient for Quartz Creek. Neither model was calibrated due to the lack of basic field data (i.e., discharge, water surface elevation, and sediment characteristics). Consequently, we were unable to draw any conclusion about the accuracy of the models.
Concerning the time step and the equations used (simplified or full) to solve the momentum equation in the HEC-RAS 2D model, we found that the minimum time step allowed by the model must be used if the diffusion wave equation is used in the simulations. A greater time step can be used if the full momentum equation is used in the simulations.
We developed a set of guidelines for reviewing model results, and developed and provided a two-day training workshop on the two models for ADOT&PF hydraulic engineers
Geometric, Feature-based and Graph-based Approaches for the Structural Analysis of Protein Binding Sites : Novel Methods and Computational Analysis
In this thesis, protein binding sites are considered. To enable the extraction of information from the space of protein binding sites, these binding sites must be mapped onto a mathematical space. This can be done by mapping binding sites onto vectors, graphs or point clouds. To finally enable a structure on the mathematical space, a distance measure is required, which is introduced in this thesis. This distance measure eventually can be used to extract information by means of data mining techniques
- …