11 research outputs found

    Weakly supervised POS tagging without disambiguation

    Get PDF
    Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data instead of the fully tagged corpora. Weakly supervised POS tagging would benefit various natural language processing applications in such languages where tagged corpora are mostly unavailable. In this article, we propose a novel framework for weakly supervised POS tagging based on a dictionary of words with their possible POS tags. In the constrained error-correcting output codes (ECOC)-based approach, a unique L-bit vector is assigned to each POS tag. The set of bitvectors is referred to as a coding matrix with value { 1, -1}. Each column of the coding matrix specifies a dichotomy over the tag space to learn a binary classifier. For each binary classifier, its training data is generated in the following way: each pair of words and its possible POS tags are considered as a positive training example only if the whole set of its possible tags falls into the positive dichotomy specified by the column coding and similarly for negative training examples. Given a word in context, its POS tag is predicted by concatenating the predictive outputs of the L binary classifiers and choosing the tag with the closest distance according to some measure. By incorporating the ECOC strategy, the set of all possible tags for each word is treated as an entirety without the need of performing disambiguation. Moreover, instead of manual feature engineering employed in most previous POS tagging approaches, features for training and testing in the proposed framework are automatically generated using neural language modeling. The proposed framework has been evaluated on three corpora for English, Italian, and Malagasy POS tagging, achieving accuracies of 93.21%, 90.9%, and 84.5% individually, which shows a significant improvement compared to the state-of-the-art approaches

    Adaptive scheduling for adaptive sampling in pos taggers construction

    Get PDF
    We introduce an adaptive scheduling for adaptive sampling as a novel way of machine learning in the construction of part-of-speech taggers. The goal is to speed up the training on large data sets, without significant loss of performance with regard to an optimal configuration. In contrast to previous methods using a random, fixed or regularly rising spacing between the instances, ours analyzes the shape of the learning curve geometrically in conjunction with a functional model to increase or decrease it at any time. The algorithm proves to be formally correct regarding our working hypotheses. Namely, given a case, the following one is the nearest ensuring a net gain of learning ability from the former, it being possible to modulate the level of requirement for this condition. We also improve the robustness of sampling by paying greater attention to those regions of the training data base subject to a temporary inflation in performance, thus preventing the learning from stopping prematurely. The proposal has been evaluated on the basis of its reliability to identify the convergence of models, corroborating our expectations. While a concrete halting condition is used for testing, users can choose any condition whatsoever to suit their own specific needs.Agencia Estatal de Investigación | Ref. TIN2017-85160-C2-1-RAgencia Estatal de Investigación | Ref. TIN2017-85160-C2-2-RXunta de Galicia | Ref. ED431C 2018/50Xunta de Galicia | Ref. ED431D 2017/1

    Modeling of learning curves with applications to POS tagging

    Get PDF
    An algorithm to estimate the evolution of learning curves on the whole of a training data base, based on the results obtained from a portion and using a functional strategy, is introduced. We approximate iteratively the sought value at the desired time, independently of the learning technique used and once a point in the process, called prediction level, has been passed. The proposal proves to be formally correct with respect to our working hypotheses and includes a reliable proximity condition. This allows the user to fix a convergence threshold with respect to the accuracy finally achievable, which extends the concept of stopping criterion and seems to be effective even in the presence of distorting observations. Our aim is to evaluate the training effort, supporting decision making in order to reduce the need for both human and computational resources during the learning process. The proposal is of interest in at least three operational procedures. The first is the anticipation of accuracy gain, with the purpose of measuring how much work is needed to achieve a certain degree of performance. The second relates the comparison of efficiency between systems at training time, with the objective of completing this task only for the one that best suits our requirements. The prediction of accuracy is also a valuable item of information for customizing systems, since we can estimate in advance the impact of settings on both the performance and the development costs. Using the generation of part-of-speech taggers as an example application, the experimental results are consistent with our expectations.Ministerio de Economía y Competitividad | Ref. FFI2014-51978-C2-1-

    Unsupervised part-of-speech tagging employing efficient graph clustering

    No full text
    An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches, the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs: one based on context similarity of high frequency words, another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon, a Viterbi POS tagger is trained, which is refined by a morphological component. The approach is evaluated on three different languages by measuring agreement with existing taggers

    Statistical langauge models for alternative sequence selection

    No full text

    An investigation into deviant morphology : issues in the implementation of a deep grammar for Indonesian

    Get PDF
    This thesis investigates deviant morphology in Indonesian for the implementation of a deep grammar. In particular we focus on the implementation of the verbal suffix -kan. This suffix has been described as having many functions, which alter the kinds of arguments and the number of arguments the verb takes (Dardjowidjojo 1971; Chung 1976; Arka 1993; Vamarasi 1999; Kroeger 2007; Son and Cole 2008). Deep grammars or precision grammars (Butt et al. 1999a; Butt et al. 2003; Bender et al. 2011) have been shown to be useful for natural language processing (NLP) tasks, such as machine translation and generation (Oepen et al. 2004; Cahill and Riester 2009; Graham 2011), and information extraction (MacKinlay et al. 2012), demonstrating the need for linguistically rich information to aid NLP tasks. Although these linguistically-motivated grammars are invaluable resources to the NLP community, the biggest drawback is the time required for the manual creation and curation of the lexicon. Our work aims to expedite this process by applying methods to assign syntactic information to kan-affixed verbs automatically. The method we employ exploits the hypothesis that semantic similarity is tightly connected with syntactic behaviour (Levin 1993). Our endeavour in automatically acquiring verbal information for an Indonesian deep grammar poses a number of lingustic challenges. First of all Indonesian verbs exhibit voice marking that is characteristic of the subgrouping of its language family. In order to be able to characterise verbal behaviour in Indonesian, we first need to devise a detailed analysis of voice for implementation. Another challenge we face is the claim that all open class words in Indonesian, at least as it is spoken in some varieties (Gil 1994; Gil 2010), cannot linguistically be analysed as being distinct from each other. That is, there is no distiction between nouns, verbs or adjectives in Indonesian, and all word from the open class categories should be analysed uniformly. This poses difficulties in implementing a grammar in a linguistically motivated way, as well discovering syntactic behaviour of verbs, if verbs cannot be distinguished from nouns. As part of our investigation we conduct experiments to verify the need to employ word class categories, and we find that indeed these are linguistically motivated labels in Indonesian. Through our investigation into deviant morphological behaviour, we gain a better characterisation of the morphosyntactic effects of -kan, and we discover that, although Indonesian has been labelled as a language with no open word class distinctions, word classes can be established as being linguistically-motivated

    The Distributional Learning of Multi-Word Expressions: A Computational Approach

    Get PDF
    There has been much recent research in corpus and computational linguistics on distributional learning algorithms—computer code that induces latent linguistic structures in corpus data based on co-occurrences of transcribed units in that data. These algorithms have varied applications, from the investigation of human cognitive processes to the corpus extraction of relevant linguistic structures for lexicographic, second language learning, or natural language processing applications, among others. They also operate at various levels of linguistic structure, from phonetics to syntax. One area of research on distributional learning algorithms in which there remains relatively little work is the learning of multi-word, memorized, formulaic sequences, based on the co-occurrences of words. Examples of such multi-word expressions (MWEs) include kick the bucket, New York City, sit down, and as a matter of fact. In this dissertation, I present a novel computational approach to the distributional learning of such sequences in corpora. Entitled MERGE (Multi-word Expressions from the Recursive Grouping of Elements), my algorithm iteratively works by (1) assigning a statistical ‘attraction’ score to each two-word sequence (bigram) in a corpus, based on the individual and co-occurrence frequencies of these two words in that corpus; and (2) merging the highest-scoring bigram into a single, lexicalized unit. These two steps then repeat until some maximum number of iterations or minimum score threshold is reached (since, broadly speaking, the winning score progressively decreases with increasing iterations). Because one (or both) of the ‘words’ making up a winning bigram may be an output merged item from a previous iteration, the algorithm is able to learn MWEs that are in principle of any length (e.g., apple pie versus I’ll believe it when I see it). Moreover, these MWEs may contain one or more discontinuities of different sizes, up to some maximum size threshold (measured in words) specified by the user (e.g., as _ as in as tall as and as big as). Typically, the extraction of MWEs has been handled by algorithms that identify only continuous sequences, and in which the user must specify the length(s) of the sequences to be extracted beforehand; thus, MERGE offers a bottom-up, distributional-based approach that addresses these issues.In the present dissertation, in addition to describing the algorithm, I report three rating experiments and one corpus-based early child language study that validate the efficacy of MERGE in identifying MWEs. In one experiment, participants rate sequences extracted from a corpus by the algorithm for how well they instantiate true MWEs. As expected, the results reveal that the high-scoring output items that MERGE identifies early in its iterative process are rated as ‘good’ MWEs by participants (based on certain subjective criteria), with the quality of these ratings decreasing for output from later iterations (i.e., output items that were scored lower by the algorithm). In the other two experiments, participants rate high-ranking output both from MERGE and from an existing algorithm from the literature that also learns MWEs of various lengths—the Adjusted Frequency List (Brook O’Donnell 2011). Comparison of participant ratings reveals that the items that MERGE acquires are rated more highly than those acquired by the Adjusted Frequency List, suggesting that MERGE is a performance frontrunner among distributional learning algorithms of MWEs. More broadly, together the experiments suggest that MERGE acquires representations that are compatible with adult knowledge of formulaic language, and thus it may be useful for any number of research applications that rely on such formulaic language as a unit of analysis.Finally, in a study using two corpora of caregiver-child interactions, I run MERGE on caregiver utterances and then show that, of the MWEs induced by the algorithm, those that go on to be later acquired by the children receive higher scores by the algorithm than those that do not go on to be learned. These results suggest that, when applied to acquisition data, the algorithm is useful for identifying the structures of statistical co-occurrences in the caregiver input that are relevant to children in their acquisition of early multi-word knowledge.Overall, MERGE is shown to be a powerful computational approach to the distributional learning and extraction of MWEs, both when modeling adult knowledge of formulaic language, and when accounting for the early multi-word structures acquired by children

    Quantitative Methoden in der Sprachtypologie: Nutzung korpusbasierter Statistiken

    Get PDF
    Die Arbeit setzt sich mit verschiedenen Aspekten der Nutzung korpusbasierter Statistiken in quantitativen typologischen Untersuchungen auseinander. Die einzelnen Abschnitte der Arbeit können als Teile einer sprachunabhängigen Prozesskette angesehen werden, die somit umfassende Untersuchungen zu den verschiedenen Sprachen der Welt erlaubt. Es werden dabei die Schritte von der automatisierten Erstellung der grundlegenden Ressourcen über die mathematisch fundierten Methoden bis hin zum fertigen Resultat der verschiedenen typologischen Analysen betrachtet. Hauptaugenmerk der Untersuchungen liegt zunächst auf den Textkorpora, die der Analyse zugrundeliegen, insbesondere auf ihrer Beschaffung und Verarbeitung unter technischen Gesichtspunkten. Es schließen sich Abhandlungen zur Nutzung der Korpora im Gebiet des lexikalischen Sprachvergleich an, wobei eine Quantifizierung sprachlicher Beziehungen mit empirischen Mitteln erreicht wird. Darüber hinaus werden die Korpora als Basis für automatisierte Messungen sprachlicher Parameter verwendet. Zum einen werden derartige messbare Eigenschaften vorgestellt, zum anderen werden sie hinsichtlich ihrer Nutzbarkeit für sprachtypologische Untersuchungen systematisch betrachtet. Abschließend werden Beziehungen dieser Messungen untereinander und zu sprachtypologischen Parametern untersucht. Dabei werden quantitative Verfahren eingesetzt
    corecore