6,091 research outputs found

    Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

    Full text link
    Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.Comment: 10 pages, 4 figure

    Compressive Mining: Fast and Optimal Data Mining in the Compressed Domain

    Full text link
    Real-world data typically contain repeated and periodic patterns. This suggests that they can be effectively represented and compressed using only a few coefficients of an appropriate basis (e.g., Fourier, Wavelets, etc.). However, distance estimation when the data are represented using different sets of coefficients is still a largely unexplored area. This work studies the optimization problems related to obtaining the \emph{tightest} lower/upper bound on Euclidean distances when each data object is potentially compressed using a different set of orthonormal coefficients. Our technique leads to tighter distance estimates, which translates into more accurate search, learning and mining operations \textit{directly} in the compressed domain. We formulate the problem of estimating lower/upper distance bounds as an optimization problem. We establish the properties of optimal solutions, and leverage the theoretical analysis to develop a fast algorithm to obtain an \emph{exact} solution to the problem. The suggested solution provides the tightest estimation of the L2L_2-norm or the correlation. We show that typical data-analysis operations, such as k-NN search or k-Means clustering, can operate more accurately using the proposed compression and distance reconstruction technique. We compare it with many other prevalent compression and reconstruction techniques, including random projections and PCA-based techniques. We highlight a surprising result, namely that when the data are highly sparse in some basis, our technique may even outperform PCA-based compression. The contributions of this work are generic as our methodology is applicable to any sequential or high-dimensional data as well as to any orthogonal data transformation used for the underlying data compression scheme.Comment: 25 pages, 20 figures, accepted in VLD

    Automated user modeling for personalized digital libraries

    Get PDF
    Digital libraries (DL) have become one of the most typical ways of accessing any kind of digitalized information. Due to this key role, users welcome any improvements on the services they receive from digital libraries. One trend used to improve digital services is through personalization. Up to now, the most common approach for personalization in digital libraries has been user-driven. Nevertheless, the design of efficient personalized services has to be done, at least in part, in an automatic way. In this context, machine learning techniques automate the process of constructing user models. This paper proposes a new approach to construct digital libraries that satisfy user’s necessity for information: Adaptive Digital Libraries, libraries that automatically learn user preferences and goals and personalize their interaction using this information

    Parsing and reflective printing, bidirectionally

    Get PDF
    Language designers usually need to implement parsers and printers. Despite being two intimately related programs, in practice they are often designed separately, and then need to be revised and kept consistent as the language evolves. It will be more convenient if the parser and printer can be unified and developed in one single program, with their consistency guaranteed automatically.Furthermore, in certain scenarios (like showing compiler optimisation results to the programmer), it is desirable to have a more powerful reflective printer that, when an abstract syntax tree corresponding to a piece of program text is modified, can reflect the modification to the program text while preserving layouts, comments, and syntactic sugar.To address these needs, we propose a domain-specific language BIYACC, whose programs denote both a parser and a reflective printer for an unambiguous context-free grammar. BIYACC is based on the theory of bidirectional transformations, which helps to guarantee by construction that the pairs of parsers and reflective printers generated by BIYACC are consistent. We show that BIYACC is capable of facilitating many tasks such as Pombrio and Krishnamurthi's "resugaring", simple refactoring, and language evolution.We would like to thank reviewers for their valuable comments. This work was partially supported by the Japan Society for the Promotion of Science (JSPS) Grant-in-Aid for Scientific Research (A) No. 25240009

    The predictor-adaptor paradigm : automation of custom layout by flexible design

    Get PDF

    Text and Spatial-Temporal Data Visualization

    Get PDF
    In this dissertation, we discuss a text visualization system, a tree drawing algorithm, a spatial-temporal data visualization paradigm and a tennis match visualization system. Corpus and corpus tools have become an important part of language teaching and learning. And yet text visualization is rarely used in this area. We present Text X-Ray, a Web tool for corpus-based language teaching and learning and the interactive text visualizations in Text X-Ray allow users to quickly examine a corpus or corpora at different levels of details: articles, paragraphs, sentences, and words. Level-based tree drawing is a common algorithm that produces intuitive and clear presentations of hierarchically structured information. However, new applications often introduces new aesthetic requirements that call for new tree drawing methods. We present an indented level-based tree drawing algorithm for visualizing parse trees of English language. This algorithm displays a tree with an aspect ratio that fits the aspect ratio of the newer computer displays, while presenting the words in a way that is easy to read. We discuss the design of the algorithm and its application in text visualization for linguistic analysis and language learning. A story is a chain of events. Each event has multiple dimensions, including time, location, characters, actions, and context. Storyline visualizations attempt to visually present the many dimensions of a story’s events and their relationships. Integrating the temporal and spatial dimension in a single visualization view is often desirable but highly challenging. One of the main reasons is that spatial data is inherently 2D while temporal data is inherently 1D. We present a storyline visualization technique that integrate both time and location information in a single view. Sports data visualization can be a useful tool for analyzing or presenting sports data. We present a new technique for visualizing tennis match data. It is designed as a supplement to online live streaming or live blogging of tennis matches and can retrieve data directly from a tennis match live blogging web site and display 2D interactive view of match statistics. Therefore, it can be easily integrated with the current live blogging platforms used by many news organizations. The visualization addresses the limitations of the current live coverage of tennis matches by providing a quick overview and also a great amount of details on demand

    Efficient abstractions for visualization and interaction

    Get PDF
    Abstractions, such as functions and methods, are an essential tool for any programmer. Abstractions encapsulate the details of a computation: the programmer only needs to know what the abstraction achieves, not how it achieves it. However, using abstractions can come at a cost: the resulting program may be inefficient. This can lead to programmers not using some abstractions, instead writing the entire functionality from the ground up. In this thesis, we present several results that make this situation less likely when programming interactive visualizations. We present results that make abstractions more efficient in the areas of graphics, layout and events
    corecore