53 research outputs found
Multilingual collocation extraction with a syntactic parser
An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP application
A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus
This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000
Extracção de palavras compostas por bootstrapping
Nesta dissertação foi proposto um novo mĂ©todo, que altera o funcionamento de um sistema existente para extracção de palavras compostas. Este sistema, o SENTA tem uma falha, e este novo mĂ©todo tem por objectivo a correcção dessa falha, extraindo assim palavras compostas que nĂŁo seriam extraĂdas pelo SENTA normal. Usando um algoritmo de bootstrapping para fazer o sistema SENTA trabalhar de forma recursiva, alterando o corpus a cada iteração
Promoting Flexible Translations in Statistical Machine Translation
While SMT systems can learn to translate multiword expressions (MWEs) from parallel text, they typically have no notion of non-compositionality, and thus overgeneralise translations that are only used in certain contexts. This paper describes a novel approach to measure the flexibility of a phrase pair, i.e. its tendency to occur in many contexts, in contrast to phrase pairs that are only valid in one or a few fixed expressions. The measure learns from the parallel training text, is simple to implement and language independent. We argue that flexible phrase pairs should be preferred over inflexible ones, and present experiments with phrase-based and hierarchical translation models in which we observe performance gains of up to 0.9 BLEU points
The presence, nature and role of formulaic sequences in English advanced learners of French : a longitudinal study
PhD ThesisThe present study is a longitudinal investigation of the presence, nature, and role of
formulaic sequences (FS) in advanced English learners of French. The learners
investigated are in their second year of an undergraduate degree in French at the onset
of the study, and are tested before and after a seven-month stay in France. FS are
defined psycholinguistically as multiword units which present a processing advantage
for a given speaker, either because they are stored whole in his/her mental lexicon
(Wray 2002) or because they are highly automatised.
The construct of FS is particularly relevant to investigate key linguistic issues such as
the dynamism of linguistic representations, their idiosyncratic nature as well as the
relationship between the lexicon and grammar. FS have been shown to be frequent in
the oral productions of native speakers. They also play an important role in first
language acquisition as well as in the initial stages of instructed second language (L2)
acquisition. However, very little is known about their presence and role in advanced L2
learners, as most studies dealing with them have not adopted a psycholinguistic
approach and have focused on L2 learners’ knowledge and use of idioms and idiomatic
expressions.
Conversely, this study seeks to evaluate and characterise the presence of
psycholinguistically-defined FS in advanced learners as well as examine their
longitudinal development in relation to the development of the learners’ fluency and
lexical diversity. It seeks to determine whether FS use can be said to play a role in the
development of fluency and lexical diversity and if it does, describe the underlying
mechanisms that account for this role.
Data from five learners performing five oral tasks (an interview, a story retell and 3
discussion tasks), repeated before and after their stay in France, was elicited and
transcribed. FS were identified through the hierarchical application of a range of criteria
aiming to capture the holistic nature of the sequences. The necessary criterion used for
identification was fluent pronunciation of the sequence, and additional criteria were
applied such as irregularity, holistic mapping of form to meaning or holistic status of the
sequence in the input. Fluency was operationalised through 4 measures (phonation-time
ratio, speaking rate, mean length of runs and articulation rate) and lexical diversity was
measured using D.
The results show that psycholinguistically-defined FS represent about 27% of the
language of advanced learners of French. The typology of the identified sequences
shows that they are mostly grammatically regular but that despite the advanced level of
the participants, some present non-nativelike characteristics. Individual differences in
the learners’ repertoires of FS as well as task effects are also found.
Between time 1 and time 2, across the group of 5 subjects, there is a general and
statistically significant increase in FS use, fluency and lexical diversity. Significant
correlations are found between FS use, fluency and lexical diversity. The qualitative
analysis suggests that FS use plays a role in increasing fluency by allowing longer
speech runs, contributing to the reduction of pausing time as well as the speeding up of
the articulation rate. At the internal level of processing mechanisms, the results suggest
that FS play a facilitating role not only in the formulation stage of speech production but
also in the conceptualisation and articulation stages. Significant correlations are also
found between FS use and lexical diversity, which suggests that FS, by lightening the
processing burden and freeing some attentional resources, might facilitate the
acquisition of new vocabulary.
The analysis of the development of the learners across all variables shows a single
developmental path with similar processes of automatisation but with different rates of
acquisition, as the learners vary in how efficient they are at proceduralising their
language. Because of this, it is suggested that the year abroad is more likely to be
beneficial for a given subject if their language has already reached a certain level of
automatisation pre-time abroad.Arts and Humanities Research Counci
Multi-word unit processing in machine translation. Developing and using language resources for multi-word unit processing in machine translation
2011 - 2012XI n.s
Corpus linguistics: A guide to the methodology
Corpora are widely used in linguistics, but not always wisely. This book attempts to frame corpus linguistics systematically as a variant of the observational method. The first part introduces the reader to the general methodological discussions surrounding corpus data as well as the practice of doing corpus linguistics, including issues such as the scientific research cycle, research design, extraction of corpus data and statistical evaluation. The second part consists of a number of case studies from the main areas of corpus linguistics (lexical associations, morphology, grammar, text and metaphor), surveying the range of issues studied in corpus linguistics while at the same time showing how they fit into the methodology outlined in the first part
- …