26 research outputs found
Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
This manual addresses the linguistic issues that arise in connection with annotating texts by part of speech ( tagging ). Section 2 is an alphabetical list of the parts of speech encoded in the annotation systems of the Penn Treebank Project, along with their corresponding abbreviations ( tags ) and some information concerning their definition. This section allows you to find an unfamiliar tag by looking up a familiar part of speech. Section 3 recapitulates the information in Section 2, but this time the information is alphabetically ordered by tags. This is the section to consult in order to find out what an unfamiliar tag means. Since the parts of speech are probably familiar to you from high school English, you should have little difficulty in assimilating the tags themselves. However, it is often quite difficult to decide which tag is appropriate in a particular context. The two sections 4 and 5 therefore include examples and guidelines on how to tag problematic cases. If you are uncertain about whether a given tag is correct or not, refer to these sections in order to ensure a consistently annotated text. Section 4 discusses parts of speech that are easily confused and gives guidelines on how to tag such cases, while Section 5 contains an alphabetical list of specific problematic words and collocations. Finally, Section 6 discusses some general tagging conventions. One general rule, however, is so important that we state it here. Many texts are not models of good prose, and some contain outright errors and slips of the pen. Do not be tempted to correct a tag to what it would be if the text were correct; rather, it is the incorrect word that should be tagged correctly
Recommended from our members
Incremental Phrase Structure Generation and a Universal Theory of V2
Remarks on Causatives and Passive
The investigation of causative constructions has been a topic of enduring interest among linguists, generative and non-generative alike. For one thing, the variability and sheer complexity of the relevant empirical domain, even within a group of closely related languages such as Romance, poses considerable and often daunting descriptive challenges. On the other hand, comparative work by linguists of various theoretical persuasions (Aissen 1974, Aissen 1979, Baker 1985, Comrie 1976, Marantz 1984, Zubizarreta 1982, Zubizarreta 1985, among many others) has shown that certain properties of causatives recur with striking regularity among unrelated and typologically otherwise diverse languages, in the absence of areal contact. This holds out the hope that the bewildering variety of data that we are faced with when we consider causative constructions can be understood with reference to a relatively small number of causative types. At first glance, the most salient distinction is that between syntactic and morphological causative formation. As is well known, in some languages the causative is expressed by means of syntactic complementation, as in the English example in (I), whereas in others it involves morphological affixation, as in the Japanese equivalent of (1) given in (2)
Recommended from our members
Parsing Early Modern English for Linguistic Search
This work addresses the question of whether the output of a state-of-the-art parser is accurate enough to support research in theoretical linguistics. In order to build reliable models of syntactic change, we aim to eventually parse the 1.5-billion-word Early English Books Online (EEBO) corpus. But since EEBO is not yet parsed, we begin by constructing and testing a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). In order to obtain robust results, we define an 8-fold split on PPCEME. We then evaluate the parser with evalb and, more relevantly for us, with a task-specific metric - namely, its accuracy in parsing 6 sentence types necessary to track the rise of auxiliary do (as in They did not come vs. its historical precursor They came not). Retrieving the relevant sentences from the gold and test versions with CorpusSearch queries, we find that the parser\u27s accuracy promises to be sufficient for our purposes. A remaining concern is the variability of the output, which we plan to address with three pieces of future work sketched in the conclusion
Recommended from our members
Parsing Early English Books Online for Linguistic Search
This work addresses the question of how to evaluate a state-of-the-art parser on Early English Books Online (EEBO), a 1.5-billion-word collection of unannotated text, for utility in linguistic research. Earlier work has trained and evaluated a parser on the 1.7-million-word Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME) and defined a query-based evaluation to score the retrieval of 6 specific sentence types of interest. However, significant differences between EEBO and the manually-annotated PPCEME make it inappropriate to assume that these results will generalize to EEBO. Fortunately, an overlap of source material in PPCEME and EEBO allows us to establish a token alignment between them and to score the POS-tagging on EEBO. We use this alignment together with a more principled version of the query-based evaluation to score the recovery of sentence types on this subset of EEBO, thus allowing us to estimate the increase in error rate on EEBO compared to PPCEME. The increase is largely due to differences in sentence segmentation between the two corpora, pointing the way to further improvements
First Steps Towards an Annotated Database of American English
This paper reports on one of the first steps in building a very large annotated database of American English. We present and discuss the results of an experiment comparing manual part-of-speech tagging with manual verification and correction of automatic stochastic tagging. The experiment shows that correcting is superior to tagging with respect to speed, consistency and accuracy
Deducing linguistic structure from the statistics of large corpora
Within the last two years, approaches using both stochastic and symbolic techniques have proved adequate to deduce lexical ambiguity resolution rules with less than 3-4 % error rate, when trained on moderat
A Part-of-Speech Tagger for Yiddish: First Steps in Tagging the Yiddish Book Center Corpus
We describe the construction and evaluation of a part-of-speech tagger for
Yiddish (the first one, to the best of our knowledge). This is the first step
in a larger project of automatically assigning part-of-speech tags and
syntactic structure to Yiddish text for purposes of linguistic research. We
combine two resources for the current work - an 80K word subset of the Penn
Parsed Corpus of Historical Yiddish (PPCHY) (Santorini, 2021) and 650 million
words of OCR'd Yiddish text from the Yiddish Book Center (YBC). We compute word
embeddings on the YBC corpus, and these embeddings are used with a tagger model
trained and evaluated on the PPCHY. Yiddish orthography in the YBC corpus has
many spelling inconsistencies, and we present some evidence that even simple
non-contextualized embeddings are able to capture the relationships among
spelling variants without the need to first "standardize" the corpus. We
evaluate the tagger performance on a 10-fold cross-validation split, with and
without the embeddings, showing that the embeddings improve tagger performance.
However, a great deal of work remains to be done, and we conclude by discussing
some next steps, including the need for additional annotated training and test
data