9 research outputs found
Layout-based substitution tree indexing and retrieval for mathematical expressions
We introduce a new system for layout-based indexing and retrieval of mathematical expressions using substitution trees. Substitution trees can efficiently store and find hierarchically-structured data based on similarity. Previously Kolhase and Sucan applied substitution trees to indexing mathematical expressions in operator tree representation (Content MathML) and query-by-expression retrieval. In this investigation, we use substitution trees to index mathematical expressions in symbol layout tree representation (LaTeX) to group expressions based on the similarity of their symbols, symbol layout, sub-expressions and size. We describe our novel substitution tree indexing and retrieval algorithms and our many significant contributions to the behavior of these algorithms, including: allowing substitution trees to index and retrieve layout-based mathematical expressions instead of predicates; introducing a bias in the insertion function that helps group expressions in the index based on similarity in baseline size; modifying the search function to find expressions that are not identical yet still structurally similar to a search query; and ranking search results based on their similarity in symbols and symbol layout to the search query. We provide an experiment testing our system against the term frequency-inverse document frequency (TF-IDF) keyword-based system of Zanibbi and Yuan and demonstrate that: in many cases, the two systems are comparable; our system excelled at finding expressions identical to the search query and expressions containing relevant sub-expressions; and our system experiences some limitations due to the insertion bias and the presence of LaTeX formatting in expressions. Future work includes: designing a different insertion bias that improves the quality of search results; modifying the behavior of the search and ranking functions; and extending the scope of the system so that it can index websites or non-LaTeX expressions (such as MathML or images). Overall, we present a promising first attempt at layout-based substitution tree indexing and retrieval for mathematical expressions
Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval
We summarize math search engines and search interfaces produced by the
Document and Pattern Recognition Lab in recent years, and in particular the min
math search interface and the Tangent search engine. Source code for both
systems are publicly available. "The Masses" refers to our emphasis on creating
systems for mathematical non-experts, who may be looking to define unfamiliar
notation, or browse documents based on the visual appearance of formulae rather
than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer
Mathematics (July, Washington DC
Content-based indexing of low resolution documents
In any multimedia presentation, the trend for attendees taking pictures of slides that
interest them during the presentation using capturing devices is gaining popularity.
To enhance the image usefulness, the images captured could be linked to image or
video database. The database can be used for the purpose of file archiving, teaching
and learning, research and knowledge management, which concern image search.
However, the above-mentioned devices include cameras or mobiles phones have low
resolution resulted from poor lighting and noise. Content-Based Image Retrieval
(CBIR) is considered among the most interesting and promising fields as far as
image search is concerned. Image search is related with finding images that are
similar for the known query image found in a given image database. This thesis
concerns with the methods used for the purpose of identifying documents that are
captured using image capturing devices. In addition, the thesis also concerns with a
technique that can be used to retrieve images from an indexed image database. Both
concerns above apply digital image processing technique. To build an indexed
structure for fast and high quality content-based retrieval of an image, some existing
representative signatures and the key indexes used have been revised. The retrieval
performance is very much relying on how the indexing is done. The retrieval
approaches that are currently in existence including making use of shape, colour and
texture features. Putting into consideration these features relative to individual
databases, the majority of retrievals approaches have poor results on low resolution
documents, consuming a lot of time and in the some cases, for the given query image,
irrelevant images are obtained. The proposed identification and indexing method in
the thesis uses a Visual Signature (VS). VS consists of the captures slides textual
layout’s graphical information, shape’s moment and spatial distribution of colour.
This approach, which is signature-based are considered for fast and efficient
matching to fulfil the needs of real-time applications. The approach also has the
capability to overcome the problem low resolution document such as noisy image,
the environment’s varying lighting conditions and complex backgrounds. We present
hierarchy indexing techniques, whose foundation are tree and clustering. K-means
clustering are used for visual features like colour since their spatial distribution give a good image’s global information. Tree indexing for extracted layout and shape
features are structured hierarchically and Euclidean distance is used to get similarity
image for CBIR. The assessment of the proposed indexing scheme is conducted
based on recall and precision, a standard CBIR retrieval performance evaluation. We
develop CBIR system and conduct various retrieval experiments with the
fundamental aim of comparing the accuracy during image retrieval. A new algorithm
that can be used with integrated visual signatures, especially in late fusion query was
introduced. The algorithm has the capability of reducing any shortcoming associated
with normalisation in initial fusion technique. Slides from conferences, lectures and
meetings presentation are used for comparing the proposed technique’s performances
with that of the existing approaches with the help of real data. This finding of the
thesis presents exciting possibilities as the CBIR systems is able to produce high
quality result even for a query, which uses low resolution documents. In the future,
the utilization of multimodal signatures, relevance feedback and artificial intelligence
technique are recommended to be used in CBIR system to further enhance the
performance
数学情報アクセスのための数式表現の検索と曖昧性解消
学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 渋谷 哲朗, 東京大学教授 萩谷 昌己, 東京大学准教授 蓮尾 一郎, 東京大学准教授 鶴岡 慶雅, 東京工業大学准教授 藤井 敦University of Tokyo(東京大学
Querying Large Collections of Semistructured Data
An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data.
We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency.
We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords.
As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup.
There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further