211 research outputs found


    Get PDF
    We review coding of multi-language text in digital form using Unicode standard, with special attention to UTF-8 variant, which is the most convenient variant for coding latin text. We also give a short tutorial for using UTF-8 in Microsoft Word, Netscape Composer and text editor Kate. Standard Unicode fonts are recommended so that the texts can be easily transfered from a computer to another one or for publishing on Internet

    Duncode Characters Shorter

    Full text link
    This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is available at \url{https://github.com/laohur/duncode}. Additionally, we have developed a benchmark for evaluating character encoders across different languages. It encompasses 179 languages and can be accessed at \url{https://github.com/laohur/wiki2txt}


    Get PDF
    We review coding of multi-language text in digital form using Unicode standard, with special attention to UTF-8 variant, which is the most convenient variant for coding latin text. We also give a short tutorial for using UTF-8 in Microsoft Word, Netscape Composer and text editor Kate. Standard Unicode fonts are recommended so that the texts can be easily transfered from a computer to another one or for publishing on Internet

    Anforderungsanalyse zur Mehrsprachigkeit eines Web-Content-Management-Systems

    Get PDF
    \u27Think global – act local!\u27 Ein bekannter Spruch, der im World Wide Web seine Gültigkeit nicht verloren hat. Im Zuge der zunehmenden Globalisierung wächst die Notwendigkeit für einen internationalen mehrsprachigen Web-Auftritt, der auf die jeweilige Zielgruppe lokalisiert zugeschnitten wird. Für den Anbieter einer globalen Web Site stellen sich verschiedene Probleme und Aufgaben. Eine globale Web Site zu erstellen heißt unter anderem, kulturelle Unterschiede zu erkennen und entsprechend in der E-Business-Strategie zu berücksichtigen. Ziel des Arbeitspapiers ist es, grundlegende Anforderungen der Mehrsprachenfähigkeit einer Web Site und daraus resultierend an ein WCMS abzuleiten. Im zweiten Kapitel werden die Implikationen der Globalisierung auf eine Web Site dargestellt, um daraus Anforderungen und Vorgehensweisen für die Gestaltung einer Web Site abzuleiten. Darauf aufbauend werden die grundlegende Struktur von WCMS und die Unterstützungsmöglichkeiten bei der Gestaltung einer mehrsprachigen Web Site durch WCMS dargestellt. Im dritten Kaptitel werden die grundlegenden Anforderungen an ein mehrsprachiges WCMS erarbeitet. Dazu werden die aufgabenspezifischen Anforderungen an eine mehrsprachige Web Site und daraus abgeleitet an ein WCMS beschrieben. Abschließend werden die technikspezifischen Anforderungen näher untersucht

    Chinese localisation of Evergreen: an open source integrated library system

    Get PDF
    Purpose - The purpose of this paper is to investigate various issues related to Chinese language localisation in Evergreen, an open source integrated library system (ILS). Design/methodology/approach - A Simplified Chinese version of Evergreen was implemented and tested and various issues such as encoding, indexing, searching, and sorting specifically associated with Simplified Chinese language were investigated. Findings - The paper finds that Unicode eases a lot of ILS development problems. However, having another language version of an ILS does not simply require the translation from one language to another. Indexing, searching, sorting and other locale related issues should be tackled not only language by language, but locale by locale. Practical implications - Most of the issues that have arisen during this project will be found with other ILS-like systems. Originality/value - This paper provides insights into issues of, and various solutions to, indexing, searching, and sorting in the Chinese language in an ILS. These issues and the solutions may be applicable to other digital library systems such as institutional repositories

    Transcoding Unicode Characters with AVX-512 Instructions

    Full text link
    Intel includes in its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512). Some of these instructions have no equivalent in earlier instruction sets. We leverage these instructions to efficiently transcode strings between the most common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice as fast as the previous best solutions. For example, we transcode Chinese text from UTF-8 to UTF-16 at more than 5 GiB/s using fewer than 2 CPU instructions per character. To ensure reproducibility, we make our software freely available as an open source library. Our library is part of the popular Node.js JavaScript runtime