426 research outputs found

    Effective Math-Aware Ad-Hoc Retrieval based on Structure Search and Semantic Similarities

    Get PDF
    Despite the prevalence of digital scientific and educational contents on the Internet, only a few search engines are capable to retrieve them efficiently and effectively. The main challenge in freely searching scientific literature arises from the presence of structured math formulas and their heterogeneous and contextually important surrounding words. This thesis introduces an effective math-aware, ad-hoc retrieval model that incorporates structure search and semantic similarities. Transformer-based neural retrievers have been adopted to capture additional semantics using domain-adapted supervised retrieval. To enable structure search, I suggest an unsupervised retrieval model that can filter potential mathematical formulas based on structure similarity. This similarity is determined by measuring the largest common substructure(s) in a formula tree representation, known as the Operator Tree (OPT). The structure matching is approximated by employing maximum matching of path-based structure features. The proposed structure similarity measurement can be tailored based on the desired effectiveness and efficiency trade-offs. It may consider various node types, such as operators and operands, and accommodate different numbers of common subtrees with varying weights. In addition to structure similarity, this unsupervised model also captures symbol substitutions through a greedy matching algorithm applied to the matched substructure(s). To achieve efficient structure search, I introduce a dynamic pruning algorithm to the problem of structure retrieval. The proposed retrieval algorithm efficiently identifies the maximum common subtree among formula candidates and safely eliminates potential structure matches that exceed a dynamic threshold. To accomplish this, three rank-safe pruning strategies are suggested and compared against exhaustive search baselines. Additionally, more aggressive thresholding policies are proposed to balance effectiveness with further speed improvements. A novel hierarchical inverted index has been implemented. This index is designed to be compatible with traditional information retrieval (IR) infrastructure and optimization techniques. To capture other semantic similarities, I have incorporated neural retrievers into a hybrid setting with structure search. This approach has achieved the state-of-the-art effectiveness in recent math information retrieval tasks. In comparison to strict and unsupervised matching, I have found that supervised neural retrievers are able to capture additional semantic similarities in a highly complementary manner. In order to learn effective representations in heterogeneous math contents, I have proposed a novel pretraining architecture that can improve the contextual awareness between math and its surrounding texts. This pretraining scheme generates effective downstream single-vector representations, eliminating the efficiency bottleneck from using multi-vector dense representations. In the end, the thesis examines future directions, specifically the integration of recent advancements in language modeling. This includes incorporating ongoing exciting developments of large language models for improved math information retrieval. A preliminary evaluation has been conducted to assess the impact of these advancements

    Visual Structure Editing of Math Formulas

    Get PDF
    Math formulas can be large and complex resulting in correspondingly large and complex LaTeX math strings for expressing them. We design operations to visually edit the typeset LaTeX formulas. The operations are invoked via the formula\u27s control points, which are created as a way to specify an operation associated with the point\u27s location relative to a symbol in the formula. At the control points, formulas can be extended in multiple ways, LaTeX can be inserted locally by typing, an existing formula can be inserted, or part of the formula itself can be moved to that point. Parts of formulas can be selected by clicking on a symbol or dragging a rectangle over an area in the formula, and the subtree for the selection can be replaced, deleted, moved to another point in the formula, or lifted out of the formula into a chip floating above the canvas. Formula chips can be used as arguments to operations, including a set of existing formulas provided in a symbol palette. Operations can be performed either by making a selection, selecting a control point operation, and then specifying an argument, or by dragging an argument to one of the control points in the formula. We perform an online formula editing experiment to examine if these visual editing operations can be used to reduce the time and actions spent in order to make edits to formulas. With 35 participants completing 18 formula editing tasks split between 3 input conditions of LaTeX only, Visual only, or LaTeX and Visual, we find that on average participants spend the least amount of time on the editing tasks when both editing capabilities are available

    Formal methods applied to the analysis of phylogenies: Phylogenetic model checking

    Get PDF
    Los árboles filogenéticos son abstracciones útiles para modelar y caracterizar la evolución de un conjunto de especies o poblaciones respecto del tiempo. La proposición, verificación y generalización de hipótesis sobre un árbol filogenético inferido juegan un papel importante en el estudio y comprensión de las relaciones evolutivas. Actualmente, uno de los principales objetivos científicos es extraer o descubrir los mensajes biológicos implícitos y las propiedades estructurales subyacentes en la filogenia. Por ejemplo, la integración de información genética en una filogenia ayuda al descubrimiento de genes conservados en todo o parte del árbol, la identificación de posiciones covariantes en el ADN o la estimación de las fechas de divergencia entre especies. Consecuentemente, los árboles ayudan a comprender el mecanismo que gobierna la deriva evolutiva. Hoy en día, el amplio espectro de métodos y herramientas heterogéneas para el análisis de filogenias enturbia y dificulta su utilización, además del fuerte acoplamiento entre la especificación de propiedades y los algoritmos utilizados para su evaluación (principalmente scripts ad hoc). Este problema es el punto de arranque de esta tesis, donde se analiza como solución la posibilidad de introducir un entorno formal de verificación de hipótesis que, de manera automática y modular, estudie la veracidad de dichas propiedades definidas en un lenguaje genérico e independiente (en una lógica formal asociada) sobre uno de los múltiples softwares preparados para ello. La contribución principal de la tesis es la propuesta de un marco formal para la descripción, verificación y manipulación de relaciones causales entre especies de forma independiente del código utilizado para su valoración. Para ello, exploramos las características de las técnicas de model checking, un paradigma en el que una especificación expresada en lógica temporal se verifica con respecto a un modelo del sistema que representa una implementación a un cierto nivel de detalle. Se ha aplicado satisfactoriamente en la industria para el modelado de sistemas y su verificación, emergiendo del ámbito de las ciencias de la computación. Las contribuciones concretas de la tesis han sido: A) La identificación e interpretación de los árboles filogeneticos como modelos de la evolución, adaptados al entorno de las técnicas de model checking. B) La definición de una lógica temporal que captura las propiedades filogenéticas habituales junto con un método de construcción de propiedades. C) La clasificación de propiedades filogenéticas, identificando categorías de propiedades según estén centradas en la estructura del árbol, en las secuencias o sean híbridas. D) La extensión de las lógicas y modelos para contemplar propiedades cuantitativas de tiempo, probabilidad y de distancias. E) El desarrollo de un entorno para la verificación de propiedades booleanas, cuantitativas y paramétricas. F) El establecimiento de los principios para la manipulación simbolica de objetos filogenéticos, p. ej., clados. G) La explotación de las herramientas de model checking existentes, detectando sus problemas y carencias en el campo de filogenia y proponiendo mejoras. H) El desarrollo de técnicas "ad hoc" para obtener ganancia de complejidad alrededor de dos frentes: distribución de los cálculos y datos, y el uso de sistemas de información. Los puntos A-F se centran en las aportaciones conceptuales de nuestra aproximación, mientras que los puntos G-H enfatizan la parte de herramientas e implementación. Los contenidos de la tesis están contrastados por la comunidad científica mediante las siguientes publicaciones en conferencias y revistas internacionales. La introducción de model checking como entorno formal para analizar propiedades biológicas (puntos A-C) ha llevado a la publicación de nuestro primer artículo de congreso [1]. En [2], desarrollamos la verificación de hipótesis filogenéticas sobre un árbol de ejemplo construido a partir de las relaciones impuestas por un conjunto de proteínas codificadas por el ADN mitocondrial humano (ADNmt). En ese ejemplo, usamos una herramienta automática y genérica de model checking (punto G). El artículo de revista [7] resume lo básico de los artículos de congreso previos y extiende la aplicación de lógicas temporales a propiedades filogenéticas no consideradas hasta ahora. Los artículos citados aquí engloban los contenidos presentados en las Parte I--II de la tesis. El enorme tamaño de los árboles y la considerable cantidad de información asociada a los estados (p.ej., la cadena de ADN) obligan a la introducción de adaptaciones especiales en las herramientas de model checking para mantener un rendimiento razonable en la verificación de propiedades y aliviar también el problema de la explosión de estados (puntos G-H). El artículo de congreso [3] presenta las ventajas de rebanar el ADN asociado a los estados, la partición de la filogenia en pequeños subárboles y su distribución entre varias máquinas. Además, la idea original del model checking rebanado se complementa con la inclusión de una base de datos externa para el almacenamiento de secuencias. El artículo de revista [4] reúne las nociones introducidas en [3] junto con la implementación y resultados preliminares presentados [5]. Este tema se corresponde con lo presentado en la Parte III de la tesis. Para terminar, la tesis reaprovecha las extensiones de las lógicas temporales con tiempo explícito y probabilidades a fin de manipular e interrogar al árbol sobre información cuantitativa. El artículo de congreso [6] ejemplifica la necesidad de introducir probabilidades y tiempo discreto para el análisis filogenético de un fenotipo real, en este caso, el ratio de distribución de la intolerancia a la lactosa entre diversas poblaciones arraigadas en las hojas de la filogenia. Esto se corresponde con el Capítulo 13, que queda englobado dentro de las Partes IV--V. Las Partes IV--V completan los conceptos presentados en ese artículo de conferencia hacia otros dominios de aplicación, como la puntuación de árboles, y tiempo continuo (puntos E-F). La introducción de parámetros en las hipótesis filogenéticas se plantea como trabajo futuro. Referencias [1] Roberto Blanco, Gregorio de Miguel Casado, José Ignacio Requeno, and José Manuel Colom. Temporal logics for phylogenetic analysis via model checking. In Proceedings IEEE International Workshop on Mining and Management of Biological and Health Data, pages 152-157. IEEE, 2010. [2] José Ignacio Requeno, Roberto Blanco, Gregorio de Miguel Casado, and José Manuel Colom. Phylogenetic analysis using an SMV tool. In Miguel P. Rocha, Juan M. Corchado Rodríguez, Florentino Fdez-Riverola, and Alfonso Valencia, editors, Proceedings 5th International Conference on Practical Applications of Computational Biology and Bioinformatics, volume 93 of Advances in Intelligent and Soft Computing, pages 167-174. Springer, Berlin, 2011. [3] José Ignacio Requeno, Roberto Blanco, Gregorio de Miguel Casado, and José Manuel Colom. Sliced model checking for phylogenetic analysis. In Miguel P. Rocha, Nicholas Luscombe, Florentino Fdez-Riverola, and Juan M. Corchado Rodríguez, editors, Proocedings 6th International Conference on Practical Applications of Computational Biology and Bioinformatics, volume 154 of Advances in Intelligent and Soft Computing, pages 95-103. Springer, Berlin, 2012. [4] José Ignacio Requeno and José Manuel Colom. Model checking software for phylogenetic trees using distribution and database methods. Journal of Integrative Bioinformatics, 10(3):229-233, 2013. [5] José Ignacio Requeno and José Manuel Colom. Speeding up phylogenetic model checking. In Mohd Saberi Mohamad, Loris Nanni, Miguel P. Rocha, and Florentino Fdez-Riverola, editors, Proceedings 7th International Conference on Practical Applications of Computational Biology and Bioinformatics, volume 222 of Advances in Intelligent Systems and Computing, pages 119-126. Springer, Berlin, 2013. [6] José Ignacio Requeno and José Manuel Colom. Timed and probabilistic model checking over phylogenetic trees. In Miguel P. Rocha et al., editors, Proceedings 8th International Conference on Practical Applications of Computational Biology and Bioinformatics, Advances in Intelligent and Soft Computing. Springer, Berlin, 2014. [7] José Ignacio Requeno, Gregorio de Miguel Casado, Roberto Blanco, and José Manuel Colom. Temporal logics for phylogenetic analysis via model checking. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10(4):1058-1070, 2013

    Tree-Based Representation and Generation of Natural and Mathematical Language

    Full text link
    Mathematical language in scientific communications and educational scenarios is important yet relatively understudied compared to natural languages. Recent works on mathematical language focus either on representing stand-alone mathematical expressions, especially in their natural tree format, or mathematical reasoning in pre-trained natural language models. Existing works on jointly modeling and generating natural and mathematical languages simply treat mathematical expressions as text, without accounting for the rigid structural properties of mathematical expressions. In this paper, we propose a series of modifications to existing language models to jointly represent and generate text and math: representing mathematical expressions as sequences of node tokens in their operator tree format, using math symbol and tree position embeddings to preserve the semantic and structural properties of mathematical expressions, and using a constrained decoding method to generate mathematically valid expressions. We ground our modifications in GPT-2, resulting in a model MathGPT, and demonstrate that it outperforms baselines on mathematical expression generation tasks

    Symbolic and Visual Retrieval of Mathematical Notation using Formula Graph Symbol Pair Matching and Structural Alignment

    Get PDF
    Large data collections containing millions of math formulae in different formats are available on-line. Retrieving math expressions from these collections is challenging. We propose a framework for retrieval of mathematical notation using symbol pairs extracted from visual and semantic representations of mathematical expressions on the symbolic domain for retrieval of text documents. We further adapt our model for retrieval of mathematical notation on images and lecture videos. Graph-based representations are used on each modality to describe math formulas. For symbolic formula retrieval, where the structure is known, we use symbol layout trees and operator trees. For image-based formula retrieval, since the structure is unknown we use a more general Line of Sight graph representation. Paths of these graphs define symbol pairs tuples that are used as the entries for our inverted index of mathematical notation. Our retrieval framework uses a three-stage approach with a fast selection of candidates as the first layer, a more detailed matching algorithm with similarity metric computation in the second stage, and finally when relevance assessments are available, we use an optional third layer with linear regression for estimation of relevance using multiple similarity scores for final re-ranking. Our model has been evaluated using large collections of documents, and preliminary results are presented for videos and cross-modal search. The proposed framework can be adapted for other domains like chemistry or technical diagrams where two visually similar elements from a collection are usually related to each other

    A Tree Locality-Sensitive Hash for Secure Software Testing

    Get PDF
    Bugs in software that make it through testing can cost tens of millions of dollars each year, and in some cases can even result in the loss of human life. In order to eliminate bugs, developers may use symbolic execution to search through possible program states looking for anomalous states. Most of the computational effort to search through these states is spent solving path constraints in order to determine the feasibility of entering each state. State merging can make this search more efficient by combining program states, allowing multiple execution paths to be analyzed at the same time. However, a merge with dissimilar path constraints dramatically increases the time necessary to solve the path constraint. Currently, there are no distance measures for path constraints, and pairwise comparison of program states is not scalable. A hashing method is presented that clusters constraints in such a way that similar constraints are placed in the same cluster without requiring pairwise comparisons between queries. When combined with other state-of-the-art state merging techniques, the hashing method allows the symbolic executor to execute more instructions per second and find more terminal execution states than the other techniques alone, without decreasing the high path coverage achieved by merging many states together

    Efficient Frequent Subtree Mining Beyond Forests

    Get PDF
    A common paradigm in distance-based learning is to embed the instance space into some appropriately chosen feature space equipped with a metric and to define the dissimilarity between instances by the distance of their images in the feature space. If the instances are graphs, then frequent connected subgraphs are a well-suited pattern language to define such feature spaces. Identifying the set of frequent connected subgraphs and subsequently computing embeddings for graph instances, however, is computationally intractable. As a result, existing frequent subgraph mining algorithms either restrict the structural complexity of the instance graphs or require exponential delay between the output of subsequent patterns. Hence distance-based learners lack an efficient way to operate on arbitrary graph data. To resolve this problem, in this thesis we present a mining system that gives up the demand on the completeness of the pattern set to instead guarantee a polynomial delay between subsequent patterns. Complementing this, we devise efficient methods to compute the embedding of arbitrary graphs into the Hamming space spanned by our pattern set. As a result, we present a system that allows to efficiently apply distance-based learning methods to arbitrary graph databases. To overcome the computational intractability of the mining step, we consider only frequent subtrees for arbitrary graph databases. This restriction alone, however, does not suffice to make the problem tractable. We reduce the mining problem from arbitrary graphs to forests by replacing each graph by a polynomially sized forest obtained from a random sample of its spanning trees. This results in an incomplete mining algorithm. However, we prove that the probability of missing a frequent subtree pattern is low. We show empirically that this is true in practice even for very small sized forests. As a result, our algorithm is able to mine frequent subtrees in a range of graph databases where state-of-the-art exact frequent subgraph mining systems fail to produce patterns in reasonable time or even at all. Furthermore, the predictive performance of our patterns is comparable to that of exact frequent connected subgraphs, where available. The above method considers polynomially many spanning trees for the forest, while many graphs have exponentially many spanning trees. The number of patterns found by our mining algorithm can be negatively influenced by this exponential gap. We hence propose a method that can (implicitly) consider forests of exponential size, while remaining computationally tractable. This results in a higher recall for our incomplete mining algorithm. Furthermore, the methods extend the known positive results on the tractability of exact frequent subtree mining to a novel class of transaction graphs. We conjecture that the next natural extension of our results to a larger transaction graph class is at least as difficult as proving whether P = NP, or not. Regarding the graph embedding step, we apply a similar strategy as in the mining step. We represent a novel graph by a forest of its spanning trees and decide whether the frequent trees from the mining step are subgraph isomorphic to this forest. As a result, the embedding computation has one-sided error with respect to the exact subgraph isomorphism test but is computationally tractable. Furthermore, we show that we can leverage a partial order on the pattern set. This structure can be used to reduce the runtime of the embedding computation dramatically. For the special case of Jaccard-similarity between graph embeddings, a further substantial reduction of runtime can be achieved using min-hashing. The Jaccard-distance can be approximated using small sketch vectors that can be computed fast, again using the partial order on the tree patterns

    Adaptive Learning and Mining for Data Streams and Frequent Patterns

    Get PDF
    Aquesta tesi està dedicada al disseny d'algorismes de mineria de dades per fluxos de dades que evolucionen en el temps i per l'extracció d'arbres freqüents tancats. Primer ens ocupem de cadascuna d'aquestes tasques per separat i, a continuació, ens ocupem d'elles conjuntament, desenvolupant mètodes de classificació de fluxos de dades que contenen elements que són arbres. En el model de flux de dades, les dades arriben a gran velocitat, i els algorismes que els han de processar tenen limitacions estrictes de temps i espai. En la primera part d'aquesta tesi proposem i mostrem un marc per desenvolupar algorismes que aprenen de forma adaptativa dels fluxos de dades que canvien en el temps. Els nostres mètodes es basen en l'ús de mòduls detectors de canvi i estimadors en els llocs correctes. Proposem ADWIN, un algorisme de finestra lliscant adaptativa, per la detecció de canvi i manteniment d'estadístiques actualitzades, i proposem utilitzar-lo com a caixa negra substituint els comptadors en algorismes inicialment no dissenyats per a dades que varien en el temps. Com ADWIN té garanties teòriques de funcionament, això obre la possibilitat d'ampliar aquestes garanties als algorismes d'aprenentatge i de mineria de dades que l'usin. Provem la nostre metodologia amb diversos mètodes d'aprenentatge com el Naïve Bayes, partició, arbres de decisió i conjunt de classificadors. Construïm un marc experimental per fer mineria amb fluxos de dades que varien en el temps, basat en el programari MOA, similar al programari WEKA, de manera que sigui fàcil pels investigadors de realitzar-hi proves experimentals. Els arbres són grafs acíclics connectats i són estudiats com vincles en molts casos. En la segona part d'aquesta tesi, descrivim un estudi formal dels arbres des del punt de vista de mineria de dades basada en tancats. A més, presentem algorismes eficients per fer tests de subarbres i per fer mineria d'arbres freqüents tancats ordenats i no ordenats. S'inclou una anàlisi de l'extracció de regles d'associació de confiança plena dels conjunts d'arbres tancats, on hem trobat un fenomen interessant: les regles que la seva contrapart proposicional és no trivial, són sempre certes en els arbres a causa de la seva peculiar combinatòria. I finalment, usant aquests resultats en fluxos de dades evolutius i la mineria d'arbres tancats freqüents, hem presentat algorismes d'alt rendiment per fer mineria d'arbres freqüents tancats de manera adaptativa en fluxos de dades que evolucionen en el temps. Introduïm una metodologia general per identificar patrons tancats en un flux de dades, utilitzant la Teoria de Reticles de Galois. Usant aquesta metodologia, desenvolupem un algorisme incremental, un basat en finestra lliscant, i finalment un que troba arbres freqüents tancats de manera adaptativa en fluxos de dades. Finalment usem aquests mètodes per a desenvolupar mètodes de classificació per a fluxos de dades d'arbres.This thesis is devoted to the design of data mining algorithms for evolving data streams and for the extraction of closed frequent trees. First, we deal with each of these tasks separately, and then we deal with them together, developing classification methods for data streams containing items that are trees. In the data stream model, data arrive at high speed, and the algorithms that must process them have very strict constraints of space and time. In the first part of this thesis we propose and illustrate a framework for developing algorithms that can adaptively learn from data streams that change over time. Our methods are based on using change detectors and estimator modules at the right places. We propose an adaptive sliding window algorithm ADWIN for detecting change and keeping updated statistics from a data stream, and use it as a black-box in place or counters or accumulators in algorithms initially not designed for drifting data. Since ADWIN has rigorous performance guarantees, this opens the possibility of extending such guarantees to learning and mining algorithms. We test our methodology with several learning methods as Naïve Bayes, clustering, decision trees and ensemble methods. We build an experimental framework for data stream mining with concept drift, based on the MOA framework, similar to WEKA, so that it will be easy for researchers to run experimental data stream benchmarks. Trees are connected acyclic graphs and they are studied as link-based structures in many cases. In the second part of this thesis, we describe a rather formal study of trees from the point of view of closure-based mining. Moreover, we present efficient algorithms for subtree testing and for mining ordered and unordered frequent closed trees. We include an analysis of the extraction of association rules of full confidence out of the closed sets of trees, and we have found there an interesting phenomenon: rules whose propositional counterpart is nontrivial are, however, always implicitly true in trees due to the peculiar combinatorics of the structures. And finally, using these results on evolving data streams mining and closed frequent tree mining, we present high performance algorithms for mining closed unlabeled rooted trees adaptively from data streams that change over time. We introduce a general methodology to identify closed patterns in a data stream, using Galois Lattice Theory. Using this methodology, we then develop an incremental one, a sliding-window based one, and finally one that mines closed trees adaptively from data streams. We use these methods to develop classification methods for tree data streams.Postprint (published version
    corecore