6,796 research outputs found

    Acta Cybernetica : Volume 19. Number 4.

    Get PDF

    Clausal tripartition, anti-locality and preliminary considerations of a formal approach to clause types

    Get PDF
    We will see how it is reasonable to speak of a minimum distance that an element must cross in order to enter into a well-formed movement dependency. In the course of the discussion of this notion of anti-localiry, a theoretical framework unfolds which is compatible with recent thoughts on syntactic computation regarding local economy and phrase structure, as well as the view that certain pronouns are grammatical formatives, rather than fully lexical expressions. The upshot will be that if an element does not move a certain distance, the derivation crashes at PF, unless the lower copy is spelled out as a pronominal element. The framework presented has a number of implications for the study of clause-typing, of which some will be discussed towards the end

    General methods for fine-grained morphological and syntactic disambiguation

    Get PDF
    We present methods for improved handling of morphologically rich languages (MRLS) where we define MRLS as languages that are morphologically more complex than English. Standard algorithms for language modeling, tagging and parsing have problems with the productive nature of such languages. Consider for example the possible forms of a typical English verb like work that generally has four four different forms: work, works, working and worked. Its Spanish counterpart trabajar has 6 different forms in present tense: trabajo, trabajas, trabaja, trabajamos, trabajáis and trabajan and more than 50 different forms when including the different tenses, moods (indicative, subjunctive and imperative) and participles. Such a high number of forms leads to sparsity issues: In a recent Wikipedia dump of more than 400 million tokens we find that 20 of these forms occur only twice or less and that 10 forms do not occur at all. This means that even if we only need unlabeled data to estimate a model and even when looking at a relatively common and frequent verb, we do not have enough data to make reasonable estimates for some of its forms. However, if we decompose an unseen form such as trabajaréis `you will work', we find that it is trabajar in future tense and second person plural. This allows us to make the predictions that are needed to decide on the grammaticality (language modeling) or syntax (tagging and parsing) of a sentence. In the first part of this thesis, we develop a morphological language model. A language model estimates the grammaticality and coherence of a sentence. Most language models used today are word-based n-gram models, which means that they estimate the transitional probability of a word following a history, the sequence of the (n - 1) preceding words. The probabilities are estimated from the frequencies of the history and the history followed by the target word in a huge text corpus. If either of the sequences is unseen, the length of the history has to be reduced. This leads to a less accurate estimate as less context is taken into account. Our morphological language model estimates an additional probability from the morphological classes of the words. These classes are built automatically by extracting morphological features from the word forms. To this end, we use unsupervised segmentation algorithms to find the suffixes of word forms. Such an algorithm might for example segment trabajaréis into trabaja and réis and we can then estimate the properties of trabajaréis from other word forms with the same or similar morphological properties. The data-driven nature of the segmentation algorithms allows them to not only find inflectional suffixes (such as -réis), but also more derivational phenomena such as the head nouns of compounds or even endings such as -tec, which identify technology oriented companies such as Vortec, Memotec and Portec and would not be regarded as a morphological suffix by traditional linguistics. Additionally, we extract shape features such as if a form contains digits or capital characters. This is important because many rare or unseen forms are proper names or numbers and often do not have meaningful suffixes. Our class-based morphological model is then interpolated with a word-based model to combine the generalization capabilities of the first and the high accuracy in case of sufficient data of the second. We evaluate our model across 21 European languages and find improvements between 3% and 11% in perplexity, a standard language modeling evaluation measure. Improvements are highest for languages with more productive and complex morphology such as Finnish and Estonian, but also visible for languages with a relatively simple morphology such as English and Dutch. We conclude that a morphological component yields consistent improvements for all the tested languages and argue that it should be part of every language model. Dependency trees represent the syntactic structure of a sentence by attaching each word to its syntactic head, the word it is directly modifying. Dependency parsing is usually tackled using heavily lexicalized (word-based) models and a thorough morphological preprocessing is important for optimal performance, especially for MRLS. We investigate if the lack of morphological features can be compensated by features induced using hidden Markov models with latent annotations (HMM-LAs) and find this to be the case for German. HMM-LAs were proposed as a method to increase part-of-speech tagging accuracy. The model splits the observed part-of-speech tags (such as verb and noun) into subtags. An expectation maximization algorithm is then used to fit the subtags to different roles. A verb tag for example might be split into an auxiliary verb and a full verb subtag. Such a split is usually beneficial because these two verb classes have different contexts. That is, a full verb might follow an auxiliary verb, but usually not another full verb. For German and English, we find that our model leads to consistent improvements over a parser not using subtag features. Looking at the labeled attachment score (LAS), the number of words correctly attached to their head, we observe an improvement from 90.34 to 90.75 for English and from 87.92 to 88.24 for German. For German, we additionally find that our model achieves almost the same performance (88.24) as a model using tags annotated by a supervised morphological tagger (LAS of 88.35). We also find that the German latent tags correlate with morphology. Articles for example are split by their grammatical case. We also investigate the part-of-speech tagging accuracies of models using the traditional treebank tagset and models using induced tagsets of the same size and find that the latter outperform the former, but are in turn outperformed by a discriminative tagger. Furthermore, we present a method for fast and accurate morphological tagging. While part-of-speech tagging annotates tokens in context with their respective word categories, morphological tagging produces a complete annotation containing all the relevant inflectional features such as case, gender and tense. A complete reading is represented as a single tag. As a reading might consist of several morphological features the resulting tagset usually contains hundreds or even thousands of tags. This is an issue for many decoding algorithms such as Viterbi which have runtimes depending quadratically on the number of tags. In the case of morphological tagging, the problem can be avoided by using a morphological analyzer. A morphological analyzer is a manually created finite-state transducer that produces the possible morphological readings of a word form. This analyzer can be used to prune the tagging lattice and to allow for the application of standard sequence labeling algorithms. The downside of this approach is that such an analyzer is not available for every language or might not have the coverage required for the task. Additionally, the output tags of some analyzers are not compatible with the annotations of the treebanks, which might require some manual mapping of the different annotations or even to reduce the complexity of the annotation. To avoid this problem we propose to use the posterior probabilities of a conditional random field (CRF) lattice to prune the space of possible taggings. At the zero-order level the posterior probabilities of a token can be calculated independently from the other tokens of a sentence. The necessary computations can thus be performed in linear time. The features available to the model at this time are similar to the features used by a morphological analyzer (essentially the word form and features based on it), but also include the immediate lexical context. As the ambiguity of word types varies substantially, we just fix the average number of readings after pruning by dynamically estimating a probability threshold. Once we obtain the pruned lattice, we can add tag transitions and convert it into a first-order lattice. The quadratic forward-backward computations are now executed on the remaining plausible readings and thus efficient. We can now continue pruning and extending the lattice order at a relatively low additional runtime cost (depending on the pruning thresholds). The training of the model can be implemented efficiently by applying stochastic gradient descent (SGD). The CRF gradient can be calculated from a lattice of any order as long as the correct reading is still in the lattice. During training, we thus run the lattice pruning until we either reach the maximal order or until the correct reading is pruned. If the reading is pruned we perform the gradient update with the highest order lattice still containing the reading. This approach is similar to early updating in the structured perceptron literature and forces the model to learn how to keep the correct readings in the lower order lattices. In practice, we observe a high number of lower updates during the first training epoch and almost exclusively higher order updates during later epochs. We evaluate our CRF tagger on six languages with different morphological properties. We find that for languages with a high word form ambiguity such as German, the pruning results in a moderate drop in tagging accuracy while for languages with less ambiguity such as Spanish and Hungarian the loss due to pruning is negligible. However, our pruning strategy allows us to train higher order models (order > 1), which give substantial improvements for all languages and also outperform unpruned first-order models. That is, the model might lose some of the correct readings during pruning, but is also able to solve more of the harder cases that require more context. We also find our model to substantially and significantly outperform a number of frequently used taggers such as Morfette and SVMTool. Based on our morphological tagger we develop a simple method to increase the performance of a state-of-the-art constituency parser. A constituency tree describes the syntactic properties of a sentence by assigning spans of text to a hierarchical bracket structure. developed a language-independent approach for the automatic annotation of accurate and compact grammars. Their implementation -- known as the Berkeley parser -- gives state-of-the-art results for many languages such as English and German. For some MRLS such as Basque and Korean, however, the parser gives unsatisfactory results because of its simple unknown word model. This model maps unknown words to a small number of signatures (similar to our morphological classes). These signatures do not seem expressive enough for many of the subtle distinctions made during parsing. We propose to replace rare words by the morphological reading generated by our tagger instead. The motivation is twofold. First, our tagger has access to a number of lexical and sublexical features not available during parsing. Second, we expect the morphological readings to contain most of the information required to make the correct parsing decision even though we know that things such as the correct attachment of prepositional phrases might require some notion of lexical semantics. In experiments on the SPMRL 2013 dataset of nine MRLS we find our method to give improvements for all languages except French for which we observe a minor drop in the Parseval score of 0.06. For Hebrew, Hungarian and Basque we find substantial absolute improvements of 5.65, 11.87 and 15.16, respectively. We also performed an extensive evaluation on the utility of word representations for morphological tagging. Our goal was to reduce the drop in performance that is caused when a model trained on a specific domain is applied to some other domain. This problem is usually addressed by domain adaption (DA). DA adapts a model towards a specific domain using a small amount of labeled or a huge amount of unlabeled data from that domain. However, this procedure requires us to train a model for every target domain. Instead we are trying to build a robust system that is trained on domain-specific labeled and domain-independent or general unlabeled data. We believe word representations to be key in the development of such models because they allow us to leverage unlabeled data efficiently. We compare data-driven representations to manually created morphological analyzers. We understand data-driven representations as models that cluster word forms or map them to a vectorial representation. Examples heavily used in the literature include Brown clusters, Singular Value Decompositions of count vectors and neural-network-based embeddings. We create a test suite of six languages consisting of in-domain and out-of-domain test sets. To this end we converted annotations for Spanish and Czech and annotated the German part of the Smultron treebank with a morphological layer. In our experiments on these data sets we find Brown clusters to outperform the other data-driven representations. Regarding the comparison with morphological analyzers, we find Brown clusters to give slightly better performance in part-of-speech tagging, but to be substantially outperformed in morphological tagging

    Az adverbiumok mondattani és jelentéstani kérdései = The syntax and syntax-semantics interface of adverbial modification

    Get PDF
    A határozószók és a határozók alaktani, mondattani és funkcionális kérdéseit vizsgáltuk a generatív nyelvelmélet keretében, főként magyar anyag alapján. Olyan leírásra törekedtünk, melyből a különféle határozófajták mondattani viselkedése, hatóköre, valamint hangsúlyozása egyaránt következik. A különféle határozótípusok PP-ként való elemzésének lehetőségét bizonyítottuk. A határozók mondatbeli elhelyezése tekintetében a specifikálói pozíció (Cinque 1999) ellen és az adjunkciós elemzés (Ernst 2002) mellett érveltünk. Megmutattuk, hogy a határozók szórendjének levezetéséhez bal- és jobboldali adjunkció feltételezése egyaránt szükséges. A különféle határozófajták szórendi helyét mondattani, jelentéstani és prozódiai tényezők összjátékával magyaráztuk. A jelentéstani tényezők között pl. a határozók inkorporálhatóságát korlátozó típusmegszorítást, a negatív határozók kötelező fókuszálását előidéző skaláris megszorítást, egyes határozófajták és igefajták komplex eseményszerkezetének inkompatibilitását vizsgáltuk. Az ige mögötti határozók szórendjét befolyásoló prozódiai tényező például a növekvő összetevők törvénye. Megfigyeltük az intonációskifejezés- újraelemzés kiváltódásának feltételeit és jelentéstani következményeit is. A helyhatározói igekötők egy típusát a mozgatási láncok sajátos fonológiai megvalósulásaként (a fonológiailag redukált kópia inkorporációjaként) elemeztük. A tárgykörben mintegy 60 tanulmányt publikáltunk. Adverbs and Adverbial Adjuncts at the Interfaces (489 old.) c. könyvünket kiadja a Mouton de Gruyter (Berlin). | This project has aimed to clarify (on the basis of mainly Hungarian data) basic issues concerning the category "adverb", the function "adverbial", and the grammar of adverbial modification. We have argued for the PP analysis of adverbials, and have claimed that they enter the derivation via left- and right-adjunction. Their merge-in position is determined by the interplay of syntactic, semantic, and prosodic factors. The semantically motivated constraints discussed also include a type restriction affecting adverbials semantically incorporated into the verbal predicate, an obligatory focus position for scalar adverbs representing negative values of bidirectional scales, cooccurrence restrictions between verbs and adverbials involving incompatible subevents, etc. The order and interpretation of adverbials in the postverbal domain is shown to be affected by such phonologically motivated constraints as the Law of Growing Constituents, and by intonation-phrase restructuring. The shape of the light-headed chain arising in the course of locative PP incorporation is determined by morpho-phonological requirements. The types of adverbs and adverbials analyzed include locatives, temporals, comitatives, epistemic adverbs, adverbs of degree, manner, counting, and frequency, quantificational adverbs, and adverbial participles. We have published about 60 studies; our book Adverbs and Adverbial Adjuncts at the Interfaces (pp. 489) is published in the series Interface Explorations of Mouton de Gruyter, Berlin

    Acta Cybernetica : Volume 18. Number 2.

    Get PDF

    Information structure and the referential status of linguistic expression : workshop as part of the 23th annual meetings of the Deutsche Gesellschaft für Sprachwissenschaft in Leipzig, Leipzig, February 28 - March 2, 2001

    Get PDF
    This volume comprises papers that were given at the workshop Information Structure and the Referential Status of Linguistic Expressions, which we organized during the Deutsche Gesellschaft für Sprachwissenschaft (DGfS) Conference in Leipzig in February 2001. At this workshop we discussed the connection between information structure and the referential interpretation of linguistic expressions, a topic mostly neglected in current linguistics research. One common aim of the papers is to find out to what extent the focus-background as well as the topic-comment structuring determine the referential interpretation of simple arguments like definite and indefinite NPs on the one hand and sentences on the other

    Moving-Time and Moving-Ego Metaphors from a Translational and Contrastivelinguistic Perspective

    Get PDF
    This article is concerned with some cross-linguistic asymmetries in the use of two types of time metaphors, the Moving-Time and the Moving-Ego metaphor. The latter metaphor appears to be far less well-entrenched in languages such as Croatian or Hungarian, i.e. some of its lexicalizations are less natural than their alternatives based on the Moving- Time metaphor, while some others are, unlike their English models, downright unacceptable. It is argued that some of the differences can be related to the status of the fictive motion construction and some restrictions on the choice of verbs in that construction

    Word learning in the first year of life

    Get PDF
    In the first part of this thesis, we ask whether 4-month-old infants can represent objects and movements after a short exposure in such a way that they recognize either a repeated object or a repeated movement when they are presented simultaneously with a new object or a new movement. If they do, we ask whether the way they observe the visual input is modified when auditory input is presented. We investigate whether infants react to the familiarization labels and to novel labels in the same manner. If the labels as well as the referents are matched for saliency, any difference should be due to processes that are not limited to sensorial perception. We hypothesize that infants will, if they map words to the objects or movements, change their looking behavior whenever they hear a familiar label, a novel label, or no label at all. In the second part of this thesis, we assess the problem of word learning from a different perspective. If infants reason about possible label-referent pairs and are able to make inferences about novel pairs, are the same processes involved in all intermodal learning? We compared the task of learning to associate auditory regularities to visual stimuli (reinforcers), and the word-learning task. We hypothesized that even if infants succeed in learning more than one label during one single event, learning the intermodal connection between auditory and visual regularities might present a more demanding task for them. The third part of this thesis addresses the role of associative learning in word learning. In the last decades, it was repeatedly suggested that co-occurrence probabilities can play an important role in word segmentation. However, the vast majority of studies test infants with artificial streams that do not resemble a natural input: most studies use words of equal length and with unambiguous syllable sequences within word, where the only point of variability is at the word boundaries (Aslin et al., 1998; Saffran, Johnson, Aslin, & Newport, 1999; Saffran et al., 1996; Thiessen et al., 2005; Thiessen & Saffran, 2003). Even if the input is modified to resemble the natural input more faithfully, the words with which infants are tested are always unambiguous \u2013 within words, each syllable predicts its adjacent syllable with the probability of 1.0 (Pelucchi, Hay, & Saffran, 2009; Thiessen et al., 2005). We therefore tested 6-month-old infants with such statistically ambiguous words. Before doing that, we also verified on a large sample of languages whether statistical information in the natural input, where the majority of the words are statistically ambiguous, is indeed useful for segmenting words. Our motivation was partly due to the fact that studies that modeled the segmentation process with a natural language input often yielded ambivalent results about the usefulness of such computation (Batchelder, 2002; Gambell & Yang, 2006; Swingley, 2005). We conclude this introduction with a small remark about the term word. It will be used throughout this thesis without questioning its descriptive value: the common-sense meaning of the term word is unambiguous enough, since all people know what are we referring to when we say or think of the term word. However, the term word is not unambiguous at all (Di Sciullo & Williams, 1987). To mention only some of the classical examples: (1) Do jump and jumped, or go and went, count as one word or as two? This example might seem all too trivial, especially in languages with weak overt morphology as English, but in some languages, each basic form of the word has tens of inflected variables. (2) A similar question arises with all the words that are morphological derivations of other words, such as evict and eviction, examine and reexamine, unhappy and happily, and so on. (3) And finally, each language contains many phrases and idioms: Does air conditioner and give up count as one word, or two? Statistical word segmentation studies in general neglect the issue of the definition of words, assuming that phrases and idioms have strong internal statistics and will therefore be selected as one word (Cutler, 2012). But because compounds or phrases are usually composed of smaller meaningful chunks, it is unclear how would infants extracts these smaller units of speech if they were using predominantly statistical information. We will address the problem of over-segmentations shortly in the third part of the thesis
    corecore