676 research outputs found
Treebanking user-generated content: A proposal for a unified representation in universal dependencies
The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD
Transition-based combinatory categorial grammar parsing for English and Hindi
Given a natural language sentence, parsing is the task of assigning it a grammatical
structure, according to the rules within a particular grammar formalism. Different
grammar formalisms like Dependency Grammar, Phrase Structure Grammar, Combinatory
Categorial Grammar, Tree Adjoining Grammar are explored in the literature for
parsing. For example, given a sentence like “John ate an apple”, parsers based on the
widely used dependency grammars find grammatical relations, such as that ‘John’ is
the subject and ‘apple’ is the object of the action ‘ate’. We mainly focus on Combinatory
Categorial Grammar (CCG) in this thesis.
In this thesis, we present an incremental algorithm for parsing CCG for two diverse
languages: English and Hindi. English is a fixed word order, SVO (Subject-Verb-
Object), and morphologically simple language, whereas, Hindi, though predominantly
a SOV (Subject-Object-Verb) language, is a free word order and morphologically rich
language. Developing an incremental parser for Hindi is really challenging since the
predicate needed to resolve dependencies comes at the end. As previously available
shift-reduce CCG parsers use English CCGbank derivations which are mostly right
branching and non-incremental, we design our algorithm based on the dependencies
resolved rather than the derivation. Our novel algorithm builds a dependency graph in
parallel to the CCG derivation which is used for revealing the unbuilt structure without
backtracking. Though we use dependencies for meaning representation and CCG for
parsing, our revealing technique can be applied to other meaning representations like
lambda expressions and for non-CCG parsing like phrase structure parsing.
Any statistical parser requires three major modules: data, parsing algorithm and
learning algorithm. This thesis is broadly divided into three parts each dealing with
one major module of the statistical parser. In Part I, we design a novel algorithm
for converting dependency treebank to CCGbank. We create Hindi CCGbank with a
decent coverage of 96% using this algorithm. We also do a cross-formalism experiment
where we show that CCG supertags can improve widely used dependency parsers.
We experiment with two popular dependency parsers (Malt and MST) for two diverse
languages: English and Hindi. For both languages, CCG categories improve the overall
accuracy of both parsers by around 0.3-0.5% in all experiments. For both parsers,
we see larger improvements specifically on dependencies at which they are known
to be weak: long distance dependencies for Malt, and verbal arguments for MST.
The result is particularly interesting in the case of the fast greedy parser (Malt), since
improving its accuracy without significantly compromising speed is relevant for large
scale applications such as parsing the web.
We present a novel algorithm for incremental transition-based CCG parsing for
English and Hindi, in Part II. Incremental parsers have potential advantages for applications
like language modeling for machine translation and speech recognition. We
introduce two new actions in the shift-reduce paradigm for revealing the required information
during parsing. We also analyze the impact of a beam and look-ahead for
parsing. In general, using a beam and/or look-ahead gives better results than not using
them. We also show that the incremental CCG parser is more useful than a non-incremental
version for predicting relative sentence complexity. Given a pair of sentences
from wikipedia and simple wikipedia, we build a classifier which predicts if one
sentence is simpler/complex than the other. We show that features from a CCG parser
in general and incremental CCG parser in particular are more useful than a chart-based
phrase structure parser both in terms of speed and accuracy.
In Part III, we develop the first neural network based training algorithm for parsing
CCG. We also study the impact of neural network based tagging models, and greedy
versus beam-search parsing, by using a structured neural network model. In greedy
settings, neural network models give significantly better results than the perceptron
models and are also over three times faster. Using a narrow beam, structured neural
network model gives consistently better results than the basic neural network model.
For English, structured neural network gives similar performance to structured perceptron
parser. But for Hindi, structured perceptron is still the winner
Discourse Structure in Machine Translation Evaluation
In this article, we explore the potential of using sentence-level discourse
structure for machine translation evaluation. We first design discourse-aware
similarity measures, which use all-subtree kernels to compare discourse parse
trees in accordance with the Rhetorical Structure Theory (RST). Then, we show
that a simple linear combination with these measures can help improve various
existing machine translation evaluation metrics regarding correlation with
human judgments both at the segment- and at the system-level. This suggests
that discourse information is complementary to the information used by many of
the existing evaluation metrics, and thus it could be taken into account when
developing richer evaluation metrics, such as the WMT-14 winning combined
metric DiscoTKparty. We also provide a detailed analysis of the relevance of
various discourse elements and relations from the RST parse trees for machine
translation evaluation. In particular we show that: (i) all aspects of the RST
tree are relevant, (ii) nuclearity is more useful than relation type, and (iii)
the similarity of the translation RST tree to the reference tree is positively
correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse
analysis. Computational Linguistics, 201
Treebanking user-generated content: a proposal for a unified representation in universal dependencies
The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD
A history of the Bhojpuri (or "Hindi") language in South Africa
Bibliography: pages 308-318.Although Indian languages have existed in South Africa for the last 125 years, there are no academic studies of any of them - of their use in South Africa, their evolution and current decline. Many misconceptions persist concerning their names, their structure, and status as 'proper' languages. This thesis deals with the history of one such language, Bhojpuri (more usually, but incorrectly, referred to as "Hindi"). I attempt to trace the origins of the South African variety of this language by examining the places of origin of the original indentured migrants who brought it to South Africa. A complex sociolinguistic picture emerges, since these immigrants came from a very wide area in North India spanning several languages. I also attempt to describe the early history of Bhojpuri in South Africa as a 'plantation' language. Subsequent changing patterns of usage are then detailed, including phonetic, syntactic, lexical and semantic change. The influence of other South African languages - chiefly English, but also Zulu, Fanagalo, and other Indian languages - is described in detail, as well as changes not directly attributable to language contact. A final section focusses on the decline of the language and the process of language death. From another (more international) perspective this study lays the foundation for comparisons between Bhojpuri in South Africa and other 'overseas' varieties of it, spawned under very similar conditions, in ex-colonies like Surinam, Fiji, Mauritius, Guyana, Trinidad and others. Such a comparative study could well make as great a contribution to general and socio-linguistics as the study of creoles has in the recent past. Information concerning this unwritten language was gathered by field-work throughout Natal. This involved informal interviews with over two hundred fluent speakers, including four who had been born in India during the time of immigrations. The study also draws upon the author's observations on language practices as an 'inside' member of the community under study
Sentence Simplification for Text Processing
A thesis submitted in partial fulfilment of the requirement of the University of Wolverhampton for the degree of Doctor of Philosophy.Propositional density and syntactic complexity are two features of sentences which
affect the ability of humans and machines to process them effectively. In this
thesis, I present a new approach to automatic sentence simplification which processes
sentences containing compound clauses and complex noun phrases (NPs)
and converts them into sequences of simple sentences which contain fewer of these
constituents and have reduced per sentence propositional density and syntactic
complexity.
My overall approach is iterative and relies on both machine learning and handcrafted
rules. It implements a small set of sentence transformation schemes, each
of which takes one sentence containing compound clauses or complex NPs and
converts it one or two simplified sentences containing fewer of these constituents
(Chapter 5). The iterative algorithm applies the schemes repeatedly and is able
to simplify sentences which contain arbitrary numbers of compound clauses and
complex NPs. The transformation schemes rely on automatic detection of these
constituents, which may take a variety of forms in input sentences. In the thesis, I
present two new shallow syntactic analysis methods which facilitate the detection
process.
The first of these identifies various explicit signs of syntactic complexity in
input sentences and classifies them according to their specific syntactic linking and bounding functions. I present the annotated resources used to train and
evaluate this sign tagger (Chapter 2) and the machine learning method used to
implement it (Chapter 3). The second syntactic analysis method exploits the sign
tagger and identifies the spans of compound clauses and complex NPs in input
sentences. In Chapter 4 of the thesis, I describe the development and evaluation
of a machine learning approach performing this task. This chapter also presents
a new annotated dataset supporting this activity.
In the thesis, I present two implementations of my approach to sentence simplification.
One of these exploits handcrafted rule activation patterns to detect
different parts of input sentences which are relevant to the simplification process.
The other implementation uses my machine learning method to identify
compound clauses and complex NPs for this purpose.
Intrinsic evaluation of the two implementations is presented in Chapter 6 together
with a comparison of their performance with several baseline systems. The
evaluation includes comparisons of system output with human-produced simplifications,
automated estimations of the readability of system output, and surveys
of human opinions on the grammaticality, accessibility, and meaning of automatically
produced simplifications.
Chapter 7 presents extrinsic evaluation of the sentence simplification method
exploiting handcrafted rule activation patterns. The extrinsic evaluation involves
three NLP tasks: multidocument summarisation, semantic role labelling, and information
extraction. Finally, in Chapter 8, conclusions are drawn and directions
for future research considered
- …