956 research outputs found
Extending, trimming and fusing WordNet for technical documents
This paper describes a tool for the automatic
extension and trimming of a multilingual
WordNet database for cross-lingual retrieval
and multilingual ontology building in
intranets and domain-specific document
collections. Hierarchies, built from
automatically extracted terms and combined
with the WordNet relations, are trimmed
with a disambiguation method based on the
document salience of the words in the
glosses. The disambiguation is tested in a
cross-lingual retrieval task, showing
considerable improvement (7%-11%). The
condensed hierarchies can be used as
browse-interfaces to the documents
complementary to retrieval
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
A Computational Lexicon and Representational Model for Arabic Multiword Expressions
The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations.
This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions.
This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
Current trends
Deep parsing is the fundamental process aiming at the representation of the syntactic
structure of phrases and sentences. In the traditional methodology this process is
based on lexicons and grammars representing roughly properties of words and interactions
of words and structures in sentences. Several linguistic frameworks, such as Headdriven
Phrase Structure Grammar (HPSG), Lexical Functional Grammar (LFG), Tree Adjoining
Grammar (TAG), Combinatory Categorial Grammar (CCG), etc., offer different
structures and combining operations for building grammar rules. These already contain
mechanisms for expressing properties of Multiword Expressions (MWE), which, however,
need improvement in how they account for idiosyncrasies of MWEs on the one
hand and their similarities to regular structures on the other hand. This collaborative
book constitutes a survey on various attempts at representing and parsing MWEs in the
context of linguistic theories and applications
- …