53 research outputs found

    Multilingual collocation extraction with a syntactic parser

    Get PDF
    An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP application

    A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus

    Get PDF
    This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000

    Extracção de palavras compostas por bootstrapping

    Get PDF
    Nesta dissertação foi proposto um novo método, que altera o funcionamento de um sistema existente para extracção de palavras compostas. Este sistema, o SENTA tem uma falha, e este novo método tem por objectivo a correcção dessa falha, extraindo assim palavras compostas que não seriam extraídas pelo SENTA normal. Usando um algoritmo de bootstrapping para fazer o sistema SENTA trabalhar de forma recursiva, alterando o corpus a cada iteração

    Promoting Flexible Translations in Statistical Machine Translation

    Get PDF
    While SMT systems can learn to translate multiword expressions (MWEs) from parallel text, they typically have no notion of non-compositionality, and thus overgeneralise translations that are only used in certain contexts. This paper describes a novel approach to measure the flexibility of a phrase pair, i.e. its tendency to occur in many contexts, in contrast to phrase pairs that are only valid in one or a few fixed expressions. The measure learns from the parallel training text, is simple to implement and language independent. We argue that flexible phrase pairs should be preferred over inflexible ones, and present experiments with phrase-based and hierarchical translation models in which we observe performance gains of up to 0.9 BLEU points

    Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop

    Get PDF

    The presence, nature and role of formulaic sequences in English advanced learners of French : a longitudinal study

    Get PDF
    PhD ThesisThe present study is a longitudinal investigation of the presence, nature, and role of formulaic sequences (FS) in advanced English learners of French. The learners investigated are in their second year of an undergraduate degree in French at the onset of the study, and are tested before and after a seven-month stay in France. FS are defined psycholinguistically as multiword units which present a processing advantage for a given speaker, either because they are stored whole in his/her mental lexicon (Wray 2002) or because they are highly automatised. The construct of FS is particularly relevant to investigate key linguistic issues such as the dynamism of linguistic representations, their idiosyncratic nature as well as the relationship between the lexicon and grammar. FS have been shown to be frequent in the oral productions of native speakers. They also play an important role in first language acquisition as well as in the initial stages of instructed second language (L2) acquisition. However, very little is known about their presence and role in advanced L2 learners, as most studies dealing with them have not adopted a psycholinguistic approach and have focused on L2 learners’ knowledge and use of idioms and idiomatic expressions. Conversely, this study seeks to evaluate and characterise the presence of psycholinguistically-defined FS in advanced learners as well as examine their longitudinal development in relation to the development of the learners’ fluency and lexical diversity. It seeks to determine whether FS use can be said to play a role in the development of fluency and lexical diversity and if it does, describe the underlying mechanisms that account for this role. Data from five learners performing five oral tasks (an interview, a story retell and 3 discussion tasks), repeated before and after their stay in France, was elicited and transcribed. FS were identified through the hierarchical application of a range of criteria aiming to capture the holistic nature of the sequences. The necessary criterion used for identification was fluent pronunciation of the sequence, and additional criteria were applied such as irregularity, holistic mapping of form to meaning or holistic status of the sequence in the input. Fluency was operationalised through 4 measures (phonation-time ratio, speaking rate, mean length of runs and articulation rate) and lexical diversity was measured using D. The results show that psycholinguistically-defined FS represent about 27% of the language of advanced learners of French. The typology of the identified sequences shows that they are mostly grammatically regular but that despite the advanced level of the participants, some present non-nativelike characteristics. Individual differences in the learners’ repertoires of FS as well as task effects are also found. Between time 1 and time 2, across the group of 5 subjects, there is a general and statistically significant increase in FS use, fluency and lexical diversity. Significant correlations are found between FS use, fluency and lexical diversity. The qualitative analysis suggests that FS use plays a role in increasing fluency by allowing longer speech runs, contributing to the reduction of pausing time as well as the speeding up of the articulation rate. At the internal level of processing mechanisms, the results suggest that FS play a facilitating role not only in the formulation stage of speech production but also in the conceptualisation and articulation stages. Significant correlations are also found between FS use and lexical diversity, which suggests that FS, by lightening the processing burden and freeing some attentional resources, might facilitate the acquisition of new vocabulary. The analysis of the development of the learners across all variables shows a single developmental path with similar processes of automatisation but with different rates of acquisition, as the learners vary in how efficient they are at proceduralising their language. Because of this, it is suggested that the year abroad is more likely to be beneficial for a given subject if their language has already reached a certain level of automatisation pre-time abroad.Arts and Humanities Research Counci

    Multi-word unit processing in machine translation. Developing and using language resources for multi-word unit processing in machine translation

    Get PDF
    2011 - 2012XI n.s

    Corpus linguistics: A guide to the methodology

    Get PDF
    Corpora are widely used in linguistics, but not always wisely. This book attempts to frame corpus linguistics systematically as a variant of the observational method. The first part introduces the reader to the general methodological discussions surrounding corpus data as well as the practice of doing corpus linguistics, including issues such as the scientific research cycle, research design, extraction of corpus data and statistical evaluation. The second part consists of a number of case studies from the main areas of corpus linguistics (lexical associations, morphology, grammar, text and metaphor), surveying the range of issues studied in corpus linguistics while at the same time showing how they fit into the methodology outlined in the first part
    • …
    corecore