3 research outputs found
Automatic Accuracy Prediction for AMR Parsing
Abstract Meaning Representation (AMR) represents sentences as directed,
acyclic and rooted graphs, aiming at capturing their meaning in a machine
readable format. AMR parsing converts natural language sentences into such
graphs. However, evaluating a parser on new data by means of comparison to
manually created AMR graphs is very costly. Also, we would like to be able to
detect parses of questionable quality, or preferring results of alternative
systems by selecting the ones for which we can assess good quality. We propose
AMR accuracy prediction as the task of predicting several metrics of
correctness for an automatically generated AMR parse - in absence of the
corresponding gold parse. We develop a neural end-to-end multi-output
regression model and perform three case studies: firstly, we evaluate the
model's capacity of predicting AMR parse accuracies and test whether it can
reliably assign high scores to gold parses. Secondly, we perform parse
selection based on predicted parse accuracies of candidate parses from
alternative systems, with the aim of improving overall results. Finally, we
predict system ranks for submissions from two AMR shared tasks on the basis of
their predicted parse accuracy averages. All experiments are carried out across
two different domains and show that our method is effective.Comment: accepted at *SEM 201
Viability of Sequence Labeling Encodings for Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents new methods for recasting dependency parsing as
a sequence labeling task yielding a viable alternative to the traditional
transition- and graph-based approaches. It is shown that sequence labeling
parsers provide several advantages for dependency parsing, such
as: (i) a good trade-off between accuracy and parsing speed, (ii) genericity
which enables running a parser in generic sequence labeling software
and (iii) pluggability which allows using full parse trees as features to
downstream tasks.
The backbone of dependency parsing as sequence labeling are the encodings
which serve as linearization methods for mapping dependency
trees into discrete labels, such that each token in a sentence is associated
with a label. We introduce three encoding families comprising: (i)
head selection, (ii) bracketing-based and (iii) transition-based encodings
which are differentiated by the way they represent a dependency
tree as a sequence of labels. We empirically examine the viability of
the encodings and provide an analysis of their facets.
Furthermore, we explore the feasibility of leveraging external complementary
data in order to enhance parsing performance. Our sequence
labeling parser is endowed with two kinds of representations. First,
we exploit the complementary nature of dependency and constituency
parsing paradigms and enrich the parser with representations from both
syntactic abstractions. Secondly, we use human language processing
data to guide our parser with representations from eye movements.
Overall, the results show that recasting dependency parsing as sequence
labeling is a viable approach that is fast and accurate and provides
a practical alternative for integrating syntax in NLP tasks.[Resumen]
Esta tesis presenta nuevos métodos para reformular el análisis sintáctico
de dependencias como una tarea de etiquetado secuencial, lo
que supone una alternativa viable a los enfoques tradicionales basados
en transiciones y grafos. Se demuestra que los analizadores de etiquetado
secuencial ofrecen varias ventajas para el análisis sintáctico de
dependencias, como por ejemplo (i) un buen equilibrio entre la precisión
y la velocidad de análisis, (ii) la genericidad que permite ejecutar
un analizador en un software genérico de etiquetado secuencial y (iii)
la conectividad que permite utilizar el árbol de análisis completo como
caracterÃsticas para las tareas posteriores.
El pilar del análisis sintáctico de dependencias como etiquetado secuencial
son las codificaciones que sirven como métodos de linealización
para transformar los árboles de dependencias en etiquetas discretas, de
forma que cada token de una frase se asocia con una etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de
núcleos, (ii) codificaciones basadas en corchetes y (iii) codificaciones basadas
en transiciones que se diferencian por la forma en que representan
un árbol de dependencias como una secuencia de etiquetas. Examinamos
empÃricamente la viabilidad de las codificaciones y ofrecemos un
análisis de sus facetas.
Además, exploramos la viabilidad de aprovechar datos complementarios
externos para mejorar el rendimiento del análisis sintáctico. Dotamos
a nuestro analizador sintáctico de dos tipos de representaciones. En
primer lugar, explotamos la naturaleza complementaria de los paradigmas
de análisis sintáctico de dependencias y constituyentes, enriqueciendo
el analizador sintáctico con representaciones de ambas abstracciones
sintácticas. En segundo lugar, utilizamos datos de procesamiento del
lenguaje humano para guiar nuestro analizador con representaciones de
los movimientos oculares.
En general, los resultados muestran que la reformulación del análisis
sintáctico de dependencias como etiquetado de secuencias es un enfoque
viable, rápido y preciso, y ofrece una alternativa práctica para integrar
la sintaxis en las tareas de PLN.[Resumo]
Esta tese presenta novos métodos para reformular a análise sintáctica
de dependencias como unha tarefa de etiquetaxe secuencial, o que
supón unha alternativa viable aos enfoques tradicionais baseados en
transicións e grafos. Demóstrase que os analizadores de etiquetaxe secuencial
ofrecen varias vantaxes para a análise sintáctica de dependencias,
por exemplo (i) un bo equilibrio entre a precisión e a velocidade
de análise, (ii) a xenericidade que permite executar un analizador nun
software xenérico de etiquetaxe secuencial e (iii) a conectividade que
permite empregar a árbore de análise completa como caracterÃsticas
para as tarefas posteriores.
O piar da análise sintáctica de dependencias como etiquetaxe secuencial
son as codificacións que serven como métodos de linealización para
transformar as árbores de dependencias en etiquetas discretas, de forma
que cada token dunha frase se asocia cunha etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de núcleos,
(ii) codificacións baseadas en corchetes e (iii) codificacións baseadas en
transicións que se diferencian pola forma en que representan unha árbore
de dependencia como unha secuencia de etiquetas. Examinamos
empÃricamente a viabilidade das codificacións e ofrecemos unha análise
das súas facetas.
Ademais, exploramos a viabilidade de aproveitar datos complementarios
externos para mellorar o rendemento da análise sintáctica. O noso
analizador sintáctico de etiquetaxe secuencial está dotado de dous tipos
de representacións. En primeiro lugar, explotamos a natureza complementaria
dos paradigmas de análise sintáctica de dependencias e constituÃntes
e enriquecemos o analizador sintáctico con representacións de
ambas abstraccións sintácticas. En segundo lugar, empregamos datos
de procesamento da linguaxe humana para guiar o noso analizador con
representacións dos movementos oculares.
En xeral, os resultados mostran que a reformulación da análise sintáctico
de dependencias como etiquetaxe de secuencias é un enfoque
viable, rápido e preciso, e ofrece unha alternativa práctica para integrar
a sintaxe nas tarefas de PLN.This work has been carried out thanks to the funding from
the European Research Council (ERC), under the European Union’s
Horizon 2020 research and innovation programme (FASTPARSE, grant
agreement No 714150)
From Texts to Prerequisites. Identifying and Annotating Propaedeutic Relations in Educational Textual Resources
openPrerequisite Relations (PRs) are dependency relations established between two distinct concepts expressing which piece(s) of information a student has to learn first in order to understand a certain target concept. Such relations are one of the most fundamental in Education, playing a crucial role not only for what concerns new knowledge acquisition, but also in the novel applications of Artificial Intelligence to distant and e-learning. Indeed, resources annotated with such information could be used to develop automatic systems able to acquire and organize the knowledge embodied in educational resources, possibly fostering educational applications personalized, e.g., on students' needs and prior knowledge.
The present thesis discusses the issues and challenges of identifying PRs in educational textual materials with the purpose of building a shared understanding of the relation among the research community. To this aim, we present a methodology for dealing with prerequisite relations as established in educational textual resources which aims at providing a systematic approach for uncovering PRs in textual materials, both when manually annotating and automatically extracting the PRs. The fundamental principles of our methodology guided the development of a novel framework for PR identification which comprises three components, each tackling a different task: (i) an annotation protocol (PREAP), reporting the set of guidelines and recommendations for building PR-annotated resources; (ii) an annotation tool (PRET), supporting the creation of manually annotated datasets reflecting the principles of PREAP; (iii) an automatic PR learning method based on machine learning (PREL). The main novelty of our methodology and framework lies in the fact that we propose to uncover PRs from textual resources relying solely on the content of the instructional material: differently from other works, rather than creating de-contextualised PRs, we acknowledge the presence of a PR between two concepts only if emerging from the way they are presented in the text. By doing so, we anchor relations to the text while modelling the knowledge structure entailed in the resource.
As an original contribution of this work, we explore whether linguistic complexity of the text influences the task of manual identification of PRs. To this aim, we investigate the interplay between text and content in educational texts through a crowd-sourcing experiment on concept sequencing. Our methodology values the content of educational materials as it incorporates the evidence acquired from such investigation which suggests that PR recognition is highly influenced by the way in which concepts are introduced in the resource and by the complexity of the texts.
The thesis reports a case study dealing with every component of the PR framework which produced a novel manually-labelled PR-annotated dataset.openXXXIII CICLO - DIGITAL HUMANITIES. TECNOLOGIE DIGITALI, ARTI, LINGUE, CULTURE E COMUNICAZIONE - Lingue, culture e tecnologie digitaliAlzetta, Chiar