26 research outputs found

    Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)

    Get PDF
    This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ( tagging ). Section 2 is an alphabetical list of the parts of speech encoded in the annotation systems of the Penn Treebank Project, along with their corresponding abbreviations ( tags ) and some information concerning their definition. This section allows you to find an unfamiliar tag by looking up a familiar part of speech. Section 3 recapitulates the information in Section 2, but this time the information is alphabetically ordered by tags. This is the section to consult in order to find out what an unfamiliar tag means. Since the parts of speech are probably familiar to you from high school English, you should have little difficulty in assimilating the tags themselves. However, it is often quite difficult to decide which tag is appropriate in a particular context. The two sections 4 and 5 therefore include examples and guidelines on how to tag problematic cases. If you are uncertain about whether a given tag is correct or not, refer to these sections in order to ensure a consistently annotated text. Section 4 discusses parts of speech that are easily confused and gives guidelines on how to tag such cases, while Section 5 contains an alphabetical list of specific problematic words and collocations. Finally, Section 6 discusses some general tagging conventions. One general rule, however, is so important that we state it here. Many texts are not models of good prose, and some contain outright errors and slips of the pen. Do not be tempted to correct a tag to what it would be if the text were correct; rather, it is the incorrect word that should be tagged correctly

    Remarks on Causatives and Passive

    Get PDF
    The investigation of causative constructions has been a topic of enduring interest among linguists, generative and non-generative alike. For one thing, the variability and sheer complexity of the relevant empirical domain, even within a group of closely related languages such as Romance, poses considerable and often daunting descriptive challenges. On the other hand, comparative work by linguists of various theoretical persuasions (Aissen 1974, Aissen 1979, Baker 1985, Comrie 1976, Marantz 1984, Zubizarreta 1982, Zubizarreta 1985, among many others) has shown that certain properties of causatives recur with striking regularity among unrelated and typologically otherwise diverse languages, in the absence of areal contact. This holds out the hope that the bewildering variety of data that we are faced with when we consider causative constructions can be understood with reference to a relatively small number of causative types. At first glance, the most salient distinction is that between syntactic and morphological causative formation. As is well known, in some languages the causative is expressed by means of syntactic complementation, as in the English example in (I), whereas in others it involves morphological affixation, as in the Japanese equivalent of (1) given in (2)

    First Steps Towards an Annotated Database of American English

    Get PDF
    This paper reports on one of the first steps in building a very large annotated database of American English. We present and discuss the results of an experiment comparing manual part-of-speech tagging with manual verification and correction of automatic stochastic tagging. The experiment shows that correcting is superior to tagging with respect to speed, consistency and accuracy

    Deducing linguistic structure from the statistics of large corpora

    Get PDF
    Within the last two years, approaches using both stochastic and symbolic techniques have proved adequate to deduce lexical ambiguity resolution rules with less than 3-4 % error rate, when trained on moderat

    A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus

    Full text link
    We describe the construction and evaluation of a part-of-speech tagger for Yiddish (the first one, to the best of our knowledge). This is the first step in a larger project of automatically assigning part-of-speech tags and syntactic structure to Yiddish text for purposes of linguistic research. We combine two resources for the current work - an 80K word subset of the Penn Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word embeddings on the YBC corpus, and these embeddings are used with a tagger model trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has many spelling inconsistencies, and we present some evidence that even simple non-contextualized embeddings are able to capture the relationships among spelling variants without the need to first "standardize" the corpus. We evaluate the tagger performance on a 10-fold cross-validation split, with and without the embeddings, showing that the embeddings improve tagger performance. However, a great deal of work remains to be done, and we conclude by discussing some next steps, including the need for additional annotated training and test data
    corecore