9 research outputs found

    Which one is better: presentation-based or content-based math search?

    Full text link
    Mathematical content is a valuable information source and retrieving this content has become an important issue. This paper compares two searching strategies for math expressions: presentation-based and content-based approaches. Presentation-based search uses state-of-the-art math search system while content-based search uses semantic enrichment of math expressions to convert math expressions into their content forms and searching is done using these content-based expressions. By considering the meaning of math expressions, the quality of search system is improved over presentation-based systems

    Performance Evaluation and Optimization of Math-Similarity Search

    Full text link
    Similarity search in math is to find mathematical expressions that are similar to a user's query. We conceptualized the similarity factors between mathematical expressions, and proposed an approach to math similarity search (MSS) by defining metrics based on those similarity factors [11]. Our preliminary implementation indicated the advantage of MSS compared to non-similarity based search. In order to more effectively and efficiently search similar math expressions, MSS is further optimized. This paper focuses on performance evaluation and optimization of MSS. Our results show that the proposed optimization process significantly improved the performance of MSS with respect to both relevance ranking and recall.Comment: 15 pages, 8 figure

    Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval

    Full text link
    We summarize math search engines and search interfaces produced by the Document and Pattern Recognition Lab in recent years, and in particular the min math search interface and the Tangent search engine. Source code for both systems are publicly available. "The Masses" refers to our emphasis on creating systems for mathematical non-experts, who may be looking to define unfamiliar notation, or browse documents based on the visual appearance of formulae rather than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer Mathematics (July, Washington DC

    Navegador ontológico matemático-NOMAT

    Get PDF
    The query algorithms in search engines use indexing, contextual analysis and ontologies, among other techniques, for text search. However, they do not use equations due to their writing complexity. NOMAT is a prototype of mathematical expression search engine that seeks information both in thesaurus and internet, using ontological tool for filtering and contextualizing information and LaTeX editor for the symbols in these expressions. This search engine was created to support mathematical research. Compared to other Internet search engines, NOMAT does not require prior knowledge of LaTeX, because has an editing tool which enables writing directly the symbols that make up the mathematical expression of interest. The results obtained were accurate and contextualized, compared to other commercial and no-commercial search engines.Los algoritmos de consulta de los motores de búsqueda utilizan indexación, análisis contextual y ontologías, entre otras técnicas, para la búsqueda de texto. Sin embargo, no utilizan ecuaciones debido a su complejidad de escritura. Nomat es un prototipo de motor de búsqueda de expresión matemática que busca información tanto en tesauro como en Internet, utilizando la Herramienta ontológica para filtrar y contextualizar información y editor de látex para los símbolos de estas expresiones. Este buscador fue creado para apoyar la investigación matemática. En comparación con otros motores de búsqueda de Internet, Nomat no requiere conocimientos previos de látex, ya que cuenta con una herramienta de edición que permite escribir directamente los símbolos que componen la expresión matemática de interés. Los resultados obtenidos fueron precisos y contextualizados, en comparación con otros motores de búsqueda comerciales y no comerciales

    Semantic Tagging of Mathematical Expressions

    Full text link
    Semantic tagging of mathematical expressions (STME) gives semantic meanings to tokens in mathematical expressions. In this work, we propose a novel STME approach that relies on neither text along with expressions, nor labelled train-ing data. Instead, our method only requires a mathemati-cal grammar set. We point out that, besides the grammar of mathematics, the special property of variables and user habits of writing expressions help us understand the im-plicit intents of the user. We build a system that considers both restrictions from the grammar and variable properties, and then apply an unsupervised method to our probabilis-tic model to learn the user habits. To evaluate our system, we build large-scale training and test datasets automatically from a public math forum. The results demonstrate the significant improvement of our method, compared to the maximum-frequency baseline. We also create statistics to reveal the properties of mathematics language

    Methods of Relevance Ranking and Hit-content Generation in Math Search

    No full text
    To be effective and useful, math search systems must not only maximize precision and recall, but also present the query hits in a form that makes it easy for the user to identify quickly the truly relevant hits. To meet that requirement, the search system must sort the hits according to domain-appropriate relevance criteria, and provide with each hit a query-relevant summary of the hit target

    数学情報アクセスのための数式表現の検索と曖昧性解消

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 渋谷 哲朗, 東京大学教授 萩谷 昌己, 東京大学准教授 蓮尾 一郎, 東京大学准教授 鶴岡 慶雅, 東京工業大学准教授 藤井 敦University of Tokyo(東京大学

    Querying Large Collections of Semistructured Data

    Get PDF
    An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data. We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency. We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords. As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup. There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further

    MECA: Mathematical Expression Based Post Publication Content Analysis

    Get PDF
    Mathematical expressions (ME) are critical abstractions for technical publications. While the sheer volume of technical publications grows in time, few ME centric applications have been developed due to the steep gap between the typesetting data in post-publication digital documents and the high-level technical semantics. With the acceleration of the technical publications every year, word-based information analysis technologies are inadequate to enable users in discovery, organizing, and interrelating technical work efficiently and effectively. This dissertation presents a modeling framework and the associated algorithms, called the mathematical-centered post-publication content analysis (MECA) system to address several critical issues to build a layered solution architecture for recovery of high-level technical information. Overall, MECA is consisted of four layers of modeling work, starting from the extraction of MEs from Portable Document Format (PDF) files. Specifically, a weakly-supervised sequential typesetting Bayesian model is developed by using a concise font-value based feature space for Bayesian inference of ME vs. words for the rendering units separated by space. A Markov Random Field (MRF) model is designed to merge and correct the MEs identified from the rendering units, which are otherwise prone to fragmentation of large MEs. At the next layer, MECA aims at the recovery of ME semantics. The first step is the ME layout analysis to disambiguate layout structures based on a Content-Constrained Spatial (CCS) global inference model to overcome local errors. It achieves high accuracy at low computing cost by a parametric lognormal model for the feature distribution of typographic systems. The ME layout is parsed into ME semantics with a three-phase processing workflow to overcome a variety of semantic ambiguities. In the first phase, the ME layout is linearized into a token sequence, upon which the abstract syntax tree (AST) is constructed in the second phase using probabilistic context-free grammar. Tree rewriting will transform the AST into ME objects in the third phase. Built upon the two layers of ME extraction and semantics modeling work, next we explore one of the bonding relationships between words and MEs: ME declarations, where the words and MEs are respectively the qualitative and quantitative (QuQn) descriptors of technical concepts. Conventional low-level PoS tagging and parsing tools have poor performance in the processing of this type of mixed word-ME (MWM) sentences. As such, we develop an MWM processing toolkit. A semi-automated weakly-supervised framework is employed for mining of declaration templates from a large amount of unlabeled data so that the templates can be used for the detection of ME declarations. On the basis of the three low-level content extraction and prediction solutions, the MECA system can extract MEs, interpret their mathematical semantics, and identify their bonding declaration words. By analyzing the dependency among these elements in a paper, we can construct a QuQn map, which essentially represents the reasoning flow of a paper. Three case studies are conducted for QuQn map applications: differential content comparison of papers, publication trend generation, and interactive mathematical learning. Outcomes from these studies suggest that MECA is a highly practical content analysis technology based on a theoretically sound framework. Much more can be expanded and improved upon for the next generation of deep content analysis solutions
    corecore