Skip to main content
Article thumbnail
Location of Repository

Character encoding in corpus construction.

By A. M. McEnery and R. Z. Xiao


This chapter first briefly reviews the history of character encoding. Following from this is a discussion of standard and non-standard native encoding systems, and an evaluation of the efforts to unify these character codes. Then we move on to discuss Unicode as well as various Unicode Transformation Formats (UTFs). As a conclusion, we recommend that Unicode (UTF-8, to be precise) be used in corpus construction

Publisher: AHDS
Year: 2005
OAI identifier:
Provided by: Lancaster E-Prints

Suggested articles


  1. (1999). A brief history of character code”.
  2. (1996). A short overview of ISO/IEC 10646
  3. (2004). Corpus linguistics and South Asian languages: Corpus creation and tool development”.
  4. (2000). The Unicode Consortium. doi
  5. (2003). Unicode Demystified.
  6. (2001). Why Unicode won’t work on the Internet”.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.