20 research outputs found
Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus
Short Message Service (SMS) messages are largely sent directly from one
person to another from their mobile phones. They represent a means of personal
communication that is an important communicative artifact in our current
digital era. As most existing studies have used private access to SMS corpora,
comparative studies using the same raw SMS data has not been possible up to
now. We describe our efforts to collect a public SMS corpus to address this
problem. We use a battery of methodologies to collect the corpus, paying
particular attention to privacy issues to address contributors' concerns. Our
live project collects new SMS message submissions, checks their quality and
adds the valid messages, releasing the resultant corpus as XML and as SQL
dumps, along with corpus statistics, every month. We opportunistically collect
as much metadata about the messages and their sender as possible, so as to
enable different types of analyses. To date, we have collected about 60,000
messages, focusing on English and Mandarin Chinese.Comment: It contains 31 pages, 6 figures, and 10 tables. It has been submitted
to Language Resource and Evaluation Journa
Verbal morphosyntactic disambiguation through topological field recognition in German-language law texts
The morphosyntactic disambiguation of verbs is a crucial pre-processing step for the syntactic analysis of morphologically rich languages like German and domains with complex clause structures like law texts. This paper explores how much linguistically motivated rules can contribute to the task. It introduces an incremental system of verbal morphosyntactic disambiguation that exploits the concept of topological fields. The system presented is capable of reducing the rate of POS-tagging mistakes from 10.2% to 1.6%. The evaluation shows that this reduction is mostly gained through checking the compatibility of morphosyntactic features within the long-distance syntactic relationships of discontinuous verbal elements. Furthermore, the present study shows that in law texts, the average distance between the left and right bracket of clauses is relatively large (9.5 tokens), and that in this domain, a wide context window is therefore necessary for the morphosyntactic disambiguation of verbs
The lexicography of German
This chapter discusses the main dictionaries of the German language as it is spoken and written in Germany, and also German as it is spoken and written in Austria, Switzerland, the eastern fringes of Belgium, and South Tyrol. It also briefly describes Pennsylvania German. Corpora and other language resources used in German dictionary-making are also presented. Finally, there is a discussion of some current issues in German lexicography, as well as future prospects