2 research outputs found
A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus
This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000
The automatic processing of multiword expressions in Irish
It is well-documented that Multiword Expressions (MWEs) pose a unique challenge
to a variety of NLP tasks such as machine translation, parsing, information retrieval,
and more. For low-resource languages such as Irish, these challenges can be exacerbated by the scarcity of data, and a lack of research in this topic. In order to
improve handling of MWEs in various NLP tasks for Irish, this thesis will address
both the lack of resources specifically targeting MWEs in Irish, and examine how
these resources can be applied to said NLP tasks.
We report on the creation and analysis of a number of lexical resources as part
of this PhD research. Ilfhocail, a lexicon of Irish MWEs, is created through extract-
ing MWEs from other lexical resources such as dictionaries. A corpus annotated
with verbal MWEs in Irish is created for the inclusion of Irish in the PARSEME
Shared Task 1.2. Additionally, MWEs were tagged in a bilingual EN-GA corpus
for inclusion in experiments in machine translation. For the purposes of annotation, a categorisation scheme for nine categories of MWEs in Irish is created, based
on combining linguistic analysis on these types of constructions and cross-lingual
frameworks for defining MWEs.
A case study in applying MWEs to NLP tasks is undertaken, with the exploration of incorporating MWE information while training Neural Machine Translation
systems. Finally, the topic of automatic identification of Irish MWEs is explored,
documenting the training of a system capable of automatically identifying Irish
MWEs from a variety of categories, and the challenges associated with developing
such a system.
This research contributes towards a greater understanding of Irish MWEs and
their applications in NLP, and provides a foundation for future work in exploring
other methods for the automatic discovery and identification of Irish MWEs, and
further developing the MWE resources described above