33 research outputs found
A Computational Lexicon and Representational Model for Arabic Multiword Expressions
The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations.
This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions.
This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena
Wide-coverage parsing for Turkish
Wide-coverage parsing is an area that attracts much attention in natural language processing
research. This is due to the fact that it is the first step tomany other applications
in natural language understanding, such as question answering.
Supervised learning using human-labelled data is currently the best performing
method. Therefore, there is great demand for annotated data. However, human annotation
is very expensive and always, the amount of annotated data is much less than
is needed to train well-performing parsers. This is the motivation behind making the
best use of data available. Turkish presents a challenge both because syntactically
annotated Turkish data is relatively small and Turkish is highly agglutinative, hence
unusually sparse at the whole word level.
METU-Sabancı Treebank is a dependency treebank of 5620 sentences with surface
dependency relations and morphological analyses for words. We show that including
even the crudest forms of morphological information extracted from the data boosts
the performance of both generative and discriminative parsers, contrary to received
opinion concerning English.
We induce word-based and morpheme-based CCG grammars from Turkish dependency
treebank. We use these grammars to train a state-of-the-art CCG parser that
predicts long-distance dependencies in addition to the ones that other parsers are capable
of predicting. We also use the correct CCG categories as simple features in a
graph-based dependency parser and show that this improves the parsing results.
We show that a morpheme-based CCG lexicon for Turkish is able to solve many
problems such as conflicts of semantic scope, recovering long-range dependencies,
and obtaining smoother statistics from the models. CCG handles linguistic phenomena
i.e. local and long-range dependencies more naturally and effectively than other linguistic
theories while potentially supporting semantic interpretation in parallel. Using
morphological information and a morpheme-cluster based lexicon improve the performance
both quantitatively and qualitatively for Turkish.
We also provide an improved version of the treebank which will be released by
kind permission of METU and Sabancı
Recommended from our members
Person-based Prominence in Ojibwe
This dissertation develops a formal and psycholinguistic theory of person-based prominence effects, the finding that certain categories of person such as first and second (the local persons) are privileged by the grammar. The thesis takes on three questions: (i) What are the possible categories related to person? (ii) What are the possible prominence relationships between these categories? And (iii) how is prominence information used to parse and interpret linguistic input in real time?
The empirical through-line is understanding obviation â a âspotlightingâ system, found most prominently in the Algonquian family of languages, that splits the (ani- mate) third persons into two categories: proximate, the person who is in the spotlight, and obviative, the persons who are introduced into the discourse, but are not in the spotlight. I provide a semantics for the feature [proximate], and detail a lattice-based theory of feature composition to derive the categories related to obviation in Border Lakes Ojibwe and beyond. This leads to insights about the syntactic and semantic relationships between person, animacy-based noun classification, number, and obviation.
The novel contribution to the theory of person-based prominence effects is to de- compose person features into sets of primitives. This proposal allows the stipulated entailment relationships between categories and features, as encoded in prominence hierarchies and feature geometries, to be derived from the first principles of set theory. I further motivate the account by showing that it has increased empirical coverage, and apply it to capture patterns of agreement and word order in Border Lakes Ojibwe.
Finally, I present a psycholinguistic study on how obviation is used to process filler- gap dependencies in Border Lakes Ojibwe. I show that obviation, and by extension, prominence information more generally, is used immediately to predictively encode movement chains, prior to bottom-up information from voice marking about the argument structure of the clause. I argue for a modular and syntax-first model of parsing, revising the Active Filler Strategy to be guided by pressures to minimize syntactic distance and maximize the expected well-formedness of each link in the chain. These pressures compete, accounting for effects of prediction, integration, and reanalysis in long-distance dependency formation
Constraints on Language Learning : behavioral and neurocognitive studies with adults and children
This thesis will contribute to a body of experimental work addressing the question of whether language learning plays a role in certain fundamental design properties of natural languages. Methodologically, this thesis seeks to extend the artificial language learning paradigm, investigating whether learners are sensitive to the constraints embodied by key properties of languages. For example, we will explore whether communicative pressure influences the final outcome of language learning, namely how the structures that are acquired by individuals are transmitted to downstream generations.
We will also explore how basic language learning constraints operate in different age groups and, importantly, cross-linguistically. Next to the behavioral experiments focusing on learning and its outcomes, we will look at preliminary electrophysiological correlates of basic compositional processing in the early stages of learning a miniature artificial language using electroencephalography (EEG). In this general introduction I will briefly discuss some of the relevant concepts and methods which will be used in three studies that constitute this thesis
Towards More Human-Like Text Summarization: Story Abstraction Using Discourse Structure and Semantic Information.
PhD ThesisWith the massive amount of textual data being produced every day,
the ability to effectively summarise text documents is becoming increasingly
important. Automatic text summarization entails the selection
and generalisation of the most salient points of a text in order
to produce a summary. Approaches to automatic text summarization
can fall into one of two categories: abstractive or extractive approaches.
Extractive approaches involve the selection and concatenation
of spans of text from a given document. Research in automatic
text summarization began with extractive approaches, scoring and
selecting sentences based on the frequency and proximity of words.
In contrast, abstractive approaches are based on a process of interpretation,
semantic representation, and generalisation. This is closer
to the processes that psycholinguistics tells us that humans perform
when reading, remembering and summarizing. However in the sixty
years since its inception, the field has largely remained focused on
extractive approaches.
This thesis aims to answer the following questions. Does knowledge
about the discourse structure of a text aid the recognition of
summary-worthy content? If so, which specific aspects of discourse
structure provide the greatest benefit? Can this structural information
be used to produce abstractive summaries, and are these more
informative than extractive summaries? To thoroughly examine these
questions, they are each considered in isolation, and as a whole, on
the basis of both manual and automatic annotations of texts. Manual
annotations facilitate an investigation into the upper bounds of
what can be achieved by the approach described in this thesis. Results
based on automatic annotations show how this same approach
is impacted by the current performance of imperfect preprocessing
steps, and indicate its feasibility.
Extractive approaches to summarization are intrinsically limited
by the surface text of the input document, in terms of both content
selection and summary generation. Beginning with a motivation
for moving away from these commonly used methods of producing
summaries, I set out my methodology for a more human-like
approach to automatic summarization which examines the benefits of
using discourse-structural information. The potential benefit of this
is twofold: moving away from a reliance on the wording of a text
in order to detect important content, and generating concise summaries
that are independent of the input text. The importance of
discourse structure to signal key textual material has previously been
recognised, however it has seen little applied use in the field of autovii
matic summarization. A consideration of evaluation metrics also features
significantly in the proposed methodology. These play a role in
both preprocessing steps and in the evaluation of the final summary
product. I provide evidence which indicates a disparity between the
performance of coreference resolution systems as indicated by their
standard evaluation metrics, and their performance in extrinsic tasks.
Additionally, I point out a range of problems for the most commonly
used metric, ROUGE, and suggest that at present summary evaluation
should not be automated.
To illustrate the general solutions proposed to the questions raised
in this thesis, I use Russian Folk Tales as an example domain. This
genre of text has been studied in depth and, most importantly, it has a
rich narrative structure that has been recorded in detail. The rules of
this formalism are suitable for the narrative structure reasoning system
presented as part of this thesis. The specific discourse-structural elements
considered cover the narrative structure of a text, coreference
information, and the story-roles fulfilled by different characters.
The proposed narrative structure reasoning system produces highlevel
interpretations of a text according to the rules of a given formalism.
For the example domain of Russian Folktales, a system is implemented
which constructs such interpretations of a tale according to
an existing set of rules and restrictions. I discuss how this process of
detecting narrative structure can be transferred to other genres, and
a key factor in the success of this process: how constrained are the
rules of the formalism. The system enumerates all possible interpretations
according to a set of constraints, meaning a less restricted rule
set leads to a greater number of interpretations.
For the example domain, sentence level discourse-structural annotations
are then used to predict summary-worthy content. The results
of this study are analysed in three parts. First, I examine the relative
utility of individual discourse features and provide a qualitative
discussion of these results. Second, the predictive abilities of these
features are compared when they are manually annotated to when
they are annotated with varying degrees of automation. Third, these
results are compared to the predictive capabilities of classic extractive
algorithms. I show that discourse features can be used to more
accurately predict summary-worthy content than classic extractive algorithms.
This holds true for automatically obtained annotations, but
with a much clearer difference when using manual annotations.
The classifiers learned in the prediction of summary-worthy sentences
are subsequently used to inform the production of both extractive
and abstractive summaries to a given length. A human-based
evaluation is used to compare these summaries, as well as the outputs
of a classic extractive summarizer. I analyse the impact of knowledge
about discourse structure, obtained both manually and automatically,
on summary production. This allows for some insight into the knock
on effects on summary production that can occur from inaccurate discourse
information (narrative structure and coreference information).
My analyses show that even given inaccurate discourse information,
the resulting abstractive summaries are considered more informative
than their extractive counterparts. With human-level knowledge
about discourse structure, these results are even clearer.
In conclusion, this research provides a framework which can be
used to detect the narrative structure of a text, and shows its potential
to provide a more human-like approach to automatic summarization.
I show the limit of what is achievable with this approach both
when manual annotations are obtainable, and when only automatic
annotations are feasible. Nevertheless, this thesis supports the suggestion
that the future of summarization lies with abstractive and not
extractive techniques