211 research outputs found
UNICODE
We review coding of multi-language text in digital form using Unicode standard, with special attention to
UTF-8 variant, which is the most convenient variant for coding latin text. We also give a short tutorial for
using UTF-8 in Microsoft Word, Netscape Composer and text editor Kate. Standard Unicode fonts are
recommended so that the texts can be easily transfered from a computer to another one or for publishing
on Internet
Duncode Characters Shorter
This paper investigates the employment of various encoders in text
transformation, converting characters into bytes. It discusses local encoders
such as ASCII and GB-2312, which encode specific characters into shorter bytes,
and universal encoders like UTF-8 and UTF-16, which can encode the complete
Unicode set with greater space requirements and are gaining widespread
acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders,
however, lack self-synchronizing capabilities. Duncode is introduced as an
innovative encoding method that aims to encode the entire Unicode character set
with high space efficiency, akin to local encoders. It has the potential to
compress multiple characters of a string into a Duncode unit using fewer bytes.
Despite offering less self-synchronizing identification information, Duncode
surpasses UTF8 in terms of space efficiency. The application is available at
\url{https://github.com/laohur/duncode}. Additionally, we have developed a
benchmark for evaluating character encoders across different languages. It
encompasses 179 languages and can be accessed at
\url{https://github.com/laohur/wiki2txt}
UNICODE
We review coding of multi-language text in digital form using Unicode standard, with special attention to
UTF-8 variant, which is the most convenient variant for coding latin text. We also give a short tutorial for
using UTF-8 in Microsoft Word, Netscape Composer and text editor Kate. Standard Unicode fonts are
recommended so that the texts can be easily transfered from a computer to another one or for publishing
on Internet
Anforderungsanalyse zur Mehrsprachigkeit eines Web-Content-Management-Systems
\u27Think global act local!\u27 Ein bekannter Spruch, der im World Wide Web seine Gültigkeit nicht verloren hat. Im Zuge der zunehmenden Globalisierung wächst die Notwendigkeit für einen internationalen mehrsprachigen Web-Auftritt, der auf die jeweilige Zielgruppe lokalisiert zugeschnitten wird. Für den Anbieter einer globalen Web Site stellen sich verschiedene Probleme und Aufgaben. Eine globale Web Site zu erstellen heißt unter anderem, kulturelle Unterschiede zu erkennen und entsprechend in der E-Business-Strategie zu berücksichtigen. Ziel des Arbeitspapiers ist es, grundlegende Anforderungen der Mehrsprachenfähigkeit einer Web Site und daraus resultierend an ein WCMS abzuleiten. Im zweiten Kapitel werden die Implikationen der Globalisierung auf eine Web Site dargestellt, um daraus Anforderungen und Vorgehensweisen für die Gestaltung einer Web Site abzuleiten. Darauf aufbauend werden die grundlegende Struktur von WCMS und die Unterstützungsmöglichkeiten bei der Gestaltung einer mehrsprachigen Web Site durch WCMS dargestellt. Im dritten Kaptitel werden die grundlegenden Anforderungen an ein mehrsprachiges WCMS erarbeitet. Dazu werden die aufgabenspezifischen Anforderungen an eine mehrsprachige Web Site und daraus abgeleitet an ein WCMS beschrieben. Abschließend werden die technikspezifischen Anforderungen näher untersucht
Chinese localisation of Evergreen: an open source integrated library system
Purpose - The purpose of this paper is to investigate various issues related to Chinese language localisation in Evergreen, an open source integrated library system (ILS).
Design/methodology/approach - A Simplified Chinese version of Evergreen was implemented and tested and various issues such as encoding, indexing, searching, and sorting specifically associated with Simplified Chinese language were investigated.
Findings - The paper finds that Unicode eases a lot of ILS development problems. However, having another language version of an ILS does not simply require the translation from one language to another. Indexing, searching, sorting and other locale related issues should be tackled not only language by language, but locale by locale.
Practical implications - Most of the issues that have arisen during this project will be found with other ILS-like systems.
Originality/value - This paper provides insights into issues of, and various solutions to, indexing, searching, and sorting in the Chinese language in an ILS. These issues and the solutions may be applicable to other digital library systems such as institutional repositories
Transcoding Unicode Characters with AVX-512 Instructions
Intel includes in its recent processors a powerful set of instructions
capable of processing 512-bit registers with a single instruction (AVX-512).
Some of these instructions have no equivalent in earlier instruction sets. We
leverage these instructions to efficiently transcode strings between the most
common formats: UTF-8 and UTF-16. With our novel algorithms, we are often twice
as fast as the previous best solutions. For example, we transcode Chinese text
from UTF-8 to UTF-16 at more than 5 GiB/s using fewer than 2 CPU instructions
per character. To ensure reproducibility, we make our software freely available
as an open source library. Our library is part of the popular Node.js
JavaScript runtime
- …