Search CORE

141 research outputs found

Extensible Markup Language (XML) 1.1

Author: World Wide Web Consortium
Publication venue: World Wide Web Consortium
Publication date: 29/09/2006
Field of study

El lenguaje extensible de marcas (XML) es un subconjunto de SGML, y aparece completamente definido en este documento. Su objetivo es permitir que SGML genérico pueda ser servido, recibido y procesado en la Web en la misma manera que hoy es posible con HTML. XML ha sido diseñado de tal manera que sea fácil de implementar y buscando interoperabilidad tanto con SGML como con HTML.Second editio

Travesía

Encryption by using base-n systems with many characters

Author: Hoenen Armin
Publication venue
Publication date: 04/06/2023
Field of study

It is possible to interpret text as numbers (and vice versa) if one interpret letters and other characters as digits and assume that they have an inherent immutable ordering. This is demonstrated by the conventional digit set of the hexadecimal system of number coding, where the letters ABCDEF in this exact alphabetic sequence stand each for a digit and thus a numerical value. In this article, we consequently elaborate this thought and include all symbols and the standard ordering of the unicode standard for digital character coding. We show how this can be used to form digit sets of different sizes and how subsequent simple conversion between bases can result in encryption mimicking results of wrong encoding and accidental noise. Unfortunately, because of encoding peculiarities, switching bases to a higher one does not necessarily result in efficient disk space compression automatically.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Text Augmentation: Inserting markup into natural language text with PPM Models

Author: Yeates Stuart Andrew
Publication venue: The University of Waikato
Publication date: 01/01/2006
Field of study

This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods. Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory. A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora

CiteSeerX

Research Commons@Waikato

Managing writing systems using orthography profiles

Author: Cysouw Michael
Moran Steven
Publication venue
Publication date: 01/01/2018
Field of study

This text is a practical guide for linguists, and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together at the intersection between the Unicode Standard and the International Phonetic Alphabet. Although these standards are often met with frustration by users, they nevertheless provide language researchers and programmers with a consistent computational architecture needed to process, publish and analyze lexical data from the world's languages. Thus we bring to light common, but not always transparent, pitfalls which researchers face when working with Unicode and IPA. Having identified and overcome these pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R tools to work with languages using orthography profiles that describe author- or document-specific orthographic conventions. In this cookbook we describe a formal specification of orthography profiles and provide recipes using open source tools to show how users can segment text, analyze it, identify errors, and to transform it into different written forms for comparative linguistics research

Institutional Repository of the Freie Universität Berlin