293 research outputs found
Larger-first partial parsing
Larger-first partial parsing is a primarily top-down approach to partial parsing that is opposite to current easy-first, or primarily bottom-up, strategies. A rich partial tree structure is captured by an algorithm that assigns a hierarchy of structural tags to each of the input tokens in a sentence. Part-of-speech tags are first assigned to the words in a sentence by a part-of-speech tagger. A cascade of Deterministic Finite State Automata then uses this part-of-speech information to identify syntactic relations primarily in a descending order of their size. The cascade is divided into four specialized sections: (1) a Comma Network, which identifies syntactic relations associated with commas; (2) a Conjunction Network, which partially disambiguates phrasal conjunctions and llly disambiguates clausal conjunctions; (3) a Clause Network, which identifies non-comma-delimited clauses; and (4) a Phrase Network, which identifies the remaining base phrases in the sentence. Each automaton is capable of adding one or more levels of structural tags to the tokens in a sentence. The larger-first approach is compared against a well-known easy-first approach. The results indicate that this larger-first approach is capable of (1) producing a more detailed partial parse than an easy first approach; (2) providing better containment of attachment ambiguity; (3) handling overlapping syntactic relations; and (4) achieving a higher accuracy than the easy-first approach. The automata of each network were developed by an empirical analysis of several sources and are presented here in detail
Sentence Simplification for Text Processing
A thesis submitted in partial fulfilment of the requirement of the University of Wolverhampton for the degree of Doctor of Philosophy.Propositional density and syntactic complexity are two features of sentences which
affect the ability of humans and machines to process them effectively. In this
thesis, I present a new approach to automatic sentence simplification which processes
sentences containing compound clauses and complex noun phrases (NPs)
and converts them into sequences of simple sentences which contain fewer of these
constituents and have reduced per sentence propositional density and syntactic
complexity.
My overall approach is iterative and relies on both machine learning and handcrafted
rules. It implements a small set of sentence transformation schemes, each
of which takes one sentence containing compound clauses or complex NPs and
converts it one or two simplified sentences containing fewer of these constituents
(Chapter 5). The iterative algorithm applies the schemes repeatedly and is able
to simplify sentences which contain arbitrary numbers of compound clauses and
complex NPs. The transformation schemes rely on automatic detection of these
constituents, which may take a variety of forms in input sentences. In the thesis, I
present two new shallow syntactic analysis methods which facilitate the detection
process.
The first of these identifies various explicit signs of syntactic complexity in
input sentences and classifies them according to their specific syntactic linking and bounding functions. I present the annotated resources used to train and
evaluate this sign tagger (Chapter 2) and the machine learning method used to
implement it (Chapter 3). The second syntactic analysis method exploits the sign
tagger and identifies the spans of compound clauses and complex NPs in input
sentences. In Chapter 4 of the thesis, I describe the development and evaluation
of a machine learning approach performing this task. This chapter also presents
a new annotated dataset supporting this activity.
In the thesis, I present two implementations of my approach to sentence simplification.
One of these exploits handcrafted rule activation patterns to detect
different parts of input sentences which are relevant to the simplification process.
The other implementation uses my machine learning method to identify
compound clauses and complex NPs for this purpose.
Intrinsic evaluation of the two implementations is presented in Chapter 6 together
with a comparison of their performance with several baseline systems. The
evaluation includes comparisons of system output with human-produced simplifications,
automated estimations of the readability of system output, and surveys
of human opinions on the grammaticality, accessibility, and meaning of automatically
produced simplifications.
Chapter 7 presents extrinsic evaluation of the sentence simplification method
exploiting handcrafted rule activation patterns. The extrinsic evaluation involves
three NLP tasks: multidocument summarisation, semantic role labelling, and information
extraction. Finally, in Chapter 8, conclusions are drawn and directions
for future research considered
Abstract syntax as interlingua: Scaling up the grammatical framework from controlled languages to robust pipelines
Syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF
Error analysis in automatic speech recognition and machine translation
Automatic speech recognition and machine translation are well-known terms in
the translation world nowadays. Systems that carry out these processes are taking over the work
of humans more and more. Reasons for this are the speed at which the tasks are performed and
their costs. However, the quality of these systems is debatable. They are not yet capable of
delivering the same performance as human transcribers or translators. The lack of creativity,
the ability to interpret texts and the sense of language is often cited as the reason why the
performance of machines is not yet at the level of human translation or transcribing work.
Despite this, there are companies that use these machines in their production pipelines.
Unbabel, an online translation platform powered by artificial intelligence, is one of these
companies. Through a combination of human translators and machines, Unbabel tries to
provide its customers with a translation of good quality. This internship report was written with
the aim of gaining an overview of the performance of these systems and the errors they produce.
Based on this work, we try to get a picture of possible error patterns produced by both systems.
The present work consists of an extensive analysis of errors produced by automatic speech
recognition and machine translation systems after automatically transcribing and translating 10
English videos into Dutch. Different videos were deliberately chosen to see if there were
significant differences in the error patterns between videos. The generated data and results from
this work, aims at providing possible ways to improve the quality of the services already
mentioned.O reconhecimento automático de fala e a tradução automática são termos conhecidos
no mundo da tradução, hoje em dia. Os sistemas que realizam esses processos estão a assumir
cada vez mais o trabalho dos humanos. As razões para isso são a velocidade com que as tarefas
são realizadas e os seus custos. No entanto, a qualidade desses sistemas é discutível. As
máquinas ainda não são capazes de ter o mesmo desempenho dos transcritores ou tradutores
humanos. A falta de criatividade, de capacidade de interpretar textos e de sensibilidade
linguística são motivos frequentemente usados para justificar o facto de as máquinas ainda não
estarem suficientemente desenvolvidas para terem um desempenho comparável com o trabalho
de tradução ou transcrição humano. Mesmo assim, existem empresas que fazem uso dessas
máquinas. A Unbabel, uma plataforma de tradução online baseada em inteligência artificial, é
uma dessas empresas. Através de uma combinação de tradutores humanos e de máquinas, a
Unbabel procura oferecer aos seus clientes traduções de boa qualidade. O presente relatório de
estágio foi feito com o intuito de obter uma visão geral do desempenho desses sistemas e das
falhas que cometem, propondo delinear uma imagem dos possíveis padrões de erro existentes
nos mesmos. Para tal, fez-se uma análise extensa das falhas que os sistemas de reconhecimento
automático de fala e de tradução automática cometeram, após a transcrição e a tradução
automática de 10 vídeos. Foram deliberadamente escolhidos registos videográficos diversos,
de modo a verificar possíveis diferenças nos padrões de erro. Através dos dados gerados e dos
resultados obtidos, propõe-se encontrar uma forma de melhorar a qualidade dos serviços já
mencionados
Identifying nocuous ambiguity in natural language requirements
This dissertation is an investigation into how ambiguity should be classified for authors and readers of text, and how this process can be automated. Usually, authors and readers disambiguate ambiguity, either consciously or unconsciously. However, disambiguation is not always appropriate. For instance, a linguistic construction may be read differently by different people, with no consensus about which reading is the intended one. This is particularly dangerous if they do not realise that other readings are possible. Misunderstandings may then occur. This is particularly serious in the field of requirements engineering. If requirements are misunderstood, systems may be built incorrectly, and this can prove very costly. Our research uses natural language processing techniques to address ambiguity in requirements. We develop a model of ambiguity, and a method of applying it, which represent a novel approach to the problem described here. Our model is based on the notion that human perception is the only valid criterion for judging ambiguity. If people perceive very differently how an ambiguity should be read, it will cause misunderstandings. Assigning a preferred reading to it is therefore unwise. In text, such ambiguities should be located and rewritten in a less ambiguous form; others need not be reformulated. We classify the former as nocuous and the latter as innocuous. We allow the dividing line between these two classifications to be adjustable. We term this the ambiguity threshold, and it represents a level of intolerance to ambiguity. A nocuous ambiguity can be an unacknowledged or an acknowledged ambiguity for a given set of readers. In the former case, they assign disparate readings to the ambiguity, but each is unaware that the others read it differently. In the latter case, they recognise that the ambiguity has more than one reading, but this fact may be unacknowledged by new readers. We present an automated approach to determine whether ambiguities in text are nocuous or innocuous. We use heuristics to distinguish ambiguities for which there is a strong consensus about how they should be read. These are innocuous ambiguities. The remaining nocuous ambiguities can then be rewritten at a later stage. We find consensus opinions about ambiguities by surveying human perceptions on them. Our heuristics try to predict these perceptions automatically. They utilise various types of linguistic information: generic corpus data, morphology and lexical subcategorisations are the most successful. We use coordination ambiguity as the test case for this research. This occurs where the scope of words such as and and or is unclear. Our research contributes to both the requirements engineering and the natural language processing literatures. Ambiguity is known to be a serious problem in requirements engineering, but has rarely been dealt with effectively and thoroughly. Our approach is an appropriate solution, and our flexible ambiguity threshold is a particularly useful concept. For instance, high ambiguity intolerance can be implemented when writing requirements for safety-critical systems. Coordination ambiguities are widespread and known to cause misunderstandings, but have received comparatively little attention. Our heuristics show that linguistic data can be used successfully to predict preferred readings of very diverse coordinations. Used in combination, these heuristics demonstrate that nocuous ambiguity can be distinguished from innocuous ambiguity under certain conditions. Employing appropriate ambiguity thresholds, accuracy representing 28% improvement on the baselines can be achieved
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
- …