132 research outputs found
Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context
Mathematical formulae represent complex semantic information in a concise
form. Especially in Science, Technology, Engineering, and Mathematics,
mathematical formulae are crucial to communicate information, e.g., in
scientific papers, and to perform computations using computer algebra systems.
Enabling computers to access the information encoded in mathematical formulae
requires machine-readable formats that can represent both the presentation and
content, i.e., the semantics, of formulae. Exchanging such information between
systems additionally requires conversion methods for mathematical
representation formats. We analyze how the semantic enrichment of formulae
improves the format conversion process and show that considering the textual
context of formulae reduces the error rate of such conversions. Our main
contributions are: (1) providing an openly available benchmark dataset for the
mathematical format conversion task consisting of a newly created test
collection, an extensive, manually curated gold standard and task-specific
evaluation metrics; (2) performing a quantitative evaluation of
state-of-the-art tools for mathematical format conversions; (3) presenting a
new approach that considers the textual context of formulae to reduce the error
rate for mathematical format conversions. Our benchmark dataset facilitates
future research on mathematical format conversions as well as research on many
problems in mathematical information retrieval. Because we annotated and linked
all components of formulae, e.g., identifiers, operators and other entities, to
Wikidata entries, the gold standard can, for instance, be used to train methods
for formula concept discovery and recognition. Such methods can then be applied
to improve mathematical information retrieval systems, e.g., for semantic
formula search, recommendation of mathematical content, or detection of
mathematical plagiarism.Comment: 10 pages, 4 figure
Dynamic Mathematical Layout in E-Books
The development of e-books technology, with the emphasis on content readability and accessibility, raises the issue of mathematical layout optimisation. Although typographic formatting is successfully conducted by incorporating complex mathematical content in the form of bitmap images, such content is not accessible to text-to-speech technologies nor does it enable readability on small screens. For the purpose of creating dynamic and accessible mathematical layout, various e-book formats are analysed and EPUB3 format is presented as the only format meeting the requirements of openness, fluidity, support, accessibility and semantically correct mathematical type. The paper defines the realisation of EPUB e-books with emphasis on the optimal mathematical type using the Mathematical Markup Language. The problems of display, typographic formatting and dynamic layout are resolved through JavaScript functions depending on the rendering technologies of various e-readers on different platforms
The Evolution, current status, and future direction of XML
The Extensible Markup Language (XML) is now established as a multifaceted open-ended markup language and continues to increase in popularity. The major players that have shaped its development include the United States government, several key corporate entities, and the World Wide Web Consortium (W3C). This paper will examine these influences on XML and will address the emergence, the current status, and the future direction of this language. In addition, it will review best practices and research that have contributed to the continued development and advancement of XML
Discovering Mathematical Objects of Interest -- A Study of Mathematical Notations
Mathematical notation, i.e., the writing system used to communicate concepts
in mathematics, encodes valuable information for a variety of information
search and retrieval systems. Yet, mathematical notations remain mostly
unutilized by today's systems. In this paper, we present the first in-depth
study on the distributions of mathematical notation in two large scientific
corpora: the open access arXiv (2.5B mathematical objects) and the mathematical
reviewing service for pure and applied mathematics zbMATH (61M mathematical
objects). Our study lays a foundation for future research projects on
mathematical information retrieval for large scientific corpora. Further, we
demonstrate the relevance of our results to a variety of use-cases. For
example, to assist semantic extraction systems, to improve scientific search
engines, and to facilitate specialized math recommendation systems. The
contributions of our presented research are as follows: (1) we present the
first distributional analysis of mathematical formulae on arXiv and zbMATH; (2)
we retrieve relevant mathematical objects for given textual search queries
(e.g., linking with `Jacobi
polynomial'); (3) we extend zbMATH's search engine by providing relevant
mathematical formulae; and (4) we exemplify the applicability of the results by
presenting auto-completion for math inputs as the first contribution to math
recommendation systems. To expedite future research projects, we have made
available our source code and data.Comment: Proceedings of The Web Conference 2020 (WWW'20), April 20--24, 2020,
Taipei, Taiwa
CEDRICS: When CEDRAM Meets Tralics
summary:We describe CEDRICS, a general purpose system for automated journal production entirely based on a LaTeX input format. We show how the very basic ideas that initiated the whole effort turned into an efficient system because of the ability of LaTeX markup to parametrise simultaneously and without compromise high typographical quality for the PDF output as well as accurate XML metadata with (presentation) MathML formulas. This was made possible by the availability of two entirely independent LaTeX source processors with specific focus but full TeX-macro language support: PdfLaTeX by Hàn Thế Thành, and Tralics by José Grimm
Encoding models for scholarly literature
We examine the issue of digital formats for document encoding, archiving and
publishing, through the specific example of "born-digital" scholarly journal
articles. We will begin by looking at the traditional workflow of journal
editing and publication, and how these practices have made the transition into
the online domain. We will examine the range of different file formats in which
electronic articles are currently stored and published. We will argue strongly
that, despite the prevalence of binary and proprietary formats such as PDF and
MS Word, XML is a far superior encoding choice for journal articles. Next, we
look at the range of XML document structures (DTDs, Schemas) which are in
common use for encoding journal articles, and consider some of their strengths
and weaknesses. We will suggest that, despite the existence of specialized
schemas intended specifically for journal articles (such as NLM), and more
broadly-used publication-oriented schemas such as DocBook, there are strong
arguments in favour of developing a subset or customization of the Text
Encoding Initiative (TEI) schema for the purpose of journal-article encoding;
TEI is already in use in a number of journal publication projects, and the
scale and precision of the TEI tagset makes it particularly appropriate for
encoding scholarly articles. We will outline the document structure of a
TEI-encoded journal article, and look in detail at suggested markup patterns
for specific features of journal articles
SBML models and MathSBML
MathSBML is an open-source, freely-downloadable Mathematica package that facilitates working with Systems Biology Markup Language (SBML) models. SBML is a toolneutral,computer-readable format for representing models of biochemical reaction networks, applicable to metabolic networks, cell-signaling pathways, genomic regulatory networks, and other modeling problems in systems biology that is widely supported by the systems biology community. SBML is based on XML, a standard medium for representing and transporting data that is widely supported on the internet as well as in computational biology and bioinformatics. Because SBML is tool-independent, it enables model transportability, reuse, publication and survival. In addition to MathSBML, a number of other tools that support SBML model examination and manipulation are provided on the sbml.org website, including libSBML, a C/C++ library for reading SBML models; an SBML Toolbox for MatLab; file conversion programs; an SBML model validator and visualizer; and SBML specifications and schemas. MathSBML enables SBML file import to and export from Mathematica as well as providing an API for model manipulation and simulation
Daisy 3: A Standard for Accessible Multimedia Books
The Daisy standard for multimedia representation of books and other material is designed to facilitate technologies that foster easy navigation and synchronized multimodal presentation for people with print-reading-related disabilities
Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articles
Purpose. This paper introduces the Research Articles in Simplified HTML (or RASH), which is a Web-first format for writing HTML-based scholarly papers; it is accompanied by the RASH Framework, a set of tools for interacting with RASH-based articles. The paper also presents an evaluation that involved authors and reviewers of RASH articles submitted to the SAVE-SD 2015 and SAVE-SD 2016 workshops.
Design. RASH has been developed aiming to: be easy to learn and use; share scholarly documents (and embedded semantic annotations) through the Web; support its adoption within the existing publishing workflow.
Findings. The evaluation study confirmed that RASH is ready to be adopted in workshops, conferences, and journals and can be quickly learnt by researchers who are familiar with HTML.
Research Limitations. The evaluation study also highlighted some issues in the adoption of RASH, and in general of HTML formats, especially by less technically savvy users. Moreover, additional tools are needed, e.g., for enabling additional conversions from/to existing formats such as OpenXML.
Practical Implications. RASH (and its Framework) is another step towards enabling the definition of formal representations of the meaning of the content of an article, facilitating its automatic discovery, enabling its linking to semantically related articles, providing access to data within the article in actionable form, and allowing integration of data between papers.
Social Implications. RASH addresses the intrinsic needs related to the various users of a scholarly article: researchers (focussing on its content), readers (experiencing new ways for browsing it), citizen scientists (reusing available data formally defined within it through semantic annotations), publishers (using the advantages of new technologies as envisioned by the Semantic Publishing movement).
Value. RASH helps authors to focus on the organisation of their texts, supports them in the task of semantically enriching the content of articles, and leaves all the issues about validation, visualisation, conversion, and semantic data extraction to the various tools developed within its Framework
- …