26,549 research outputs found

    Quantity superlatives in Germanic, or, ‘Life on the fault line between adjective and determiner'

    Full text link
    This paper concerns the superlative forms of the words many, much, few, and little, and their equivalents in other Germanic languages (German, Dutch, Swedish, Norwegian, Danish, Dalecarlian, Icelandic, and Faroese). It demonstrates that every possible relationship between definiteness and interpretation is attested. It also demonstrates that agreement mismatches are found with relative readings and with proportional readings, but different kinds of agreement mismatches in each case. One consistent pattern is that a quantity superlative with adverbial morphology and neuter singular agreement features is used with relative superlatives. On the other hand, quantity superlatives with proportional readings always agree in number. I conclude that quantity superlatives are not structurally analogous to quality superlatives on either relative or proportional readings, but they depart from a plain attributive structure in different ways. On relative readings they can be akin to pseudopartitives (as in a cup of tea), while proportional readings are more closely related to partitives (as in a piece of the cake). More specifically, I suggest that the agreement features of a superlative exhibits depend on the domain from which the target is drawn (the target-domain hypothesis). When the target is a degree, as it is with adverbial superlatives and certain relative superlatives, default neuter singular emerges. Definiteness there is driven by the same process that drives definiteness with adverbial superlatives. With proportional readings, the target argument of the superlative is a subpart or subset of the domain indicated by the substance noun, hence number agreement. Subtle aspects of how the comparison class and the superlative marker are construed determine definiteness for proportional readings.http://eecoppock.info/germanic.pdfAccepted manuscrip

    A Large-Scale Comparison of Historical Text Normalization Systems

    Get PDF
    There is no consensus on the state-of-the-art approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder--decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experiments on eight languages, comparing systems spanning all categories of proposed normalization techniques, analysing the effect of training data quantity, and using different evaluation methods. The datasets and scripts are made publicly available.Comment: Accepted at NAACL 201

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    AAA: Fair Evaluation for Abuse Detection Systems Wanted

    Get PDF

    Apportioning Development Effort in a Probabilistic LR Parsing System through Evaluation

    Get PDF
    We describe an implemented system for robust domain-independent syntactic parsing of English, using a unification-based grammar of part-of-speech and punctuation labels coupled with a probabilistic LR parser. We present evaluations of the system's performance along several different dimensions; these enable us to assess the contribution that each individual part is making to the success of the system as a whole, and thus prioritise the effort to be devoted to its further enhancement. Currently, the system is able to parse around 80% of sentences in a substantial corpus of general text containing a number of distinct genres. On a random sample of 250 such sentences the system has a mean crossing bracket rate of 0.71 and recall and precision of 83% and 84% respectively when evaluated against manually-disambiguated analyses.Comment: 10 pages, 1 Postscript figure. To Appear in Proceedings of the Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania, May 199
    corecore