3 research outputs found
Constituent Parsing as Sequence Labeling
We introduce a method to reduce constituent parsing to sequence labeling. For
each word w_t, it generates a label that encodes: (1) the number of ancestors
in the tree that the words w_t and w_{t+1} have in common, and (2) the
nonterminal symbol at the lowest common ancestor. We first prove that the
proposed encoding function is injective for any tree without unary branches. In
practice, the approach is made extensible to all constituency trees by
collapsing unary branches. We then use the PTB and CTB treebanks as testbeds
and propose a set of fast baselines. We achieve 90.7% F-score on the PTB test
set, outperforming the Vinyals et al. (2015) sequence-to-sequence parser. In
addition, sacrificing some accuracy, our approach achieves the fastest
constituent parsing speeds reported to date on PTB by a wide margin.Comment: EMNLP 2018 (Long Papers). Revised version with improved results after
fixing evaluation bu
Layer-Based Dependency Parsing
PACLIC 23 / City University of Hong Kong / 3-5 December 200
A Linearization Framework for Dependency and Constituent Trees
[Abstract]: Parsing is a core natural language processing problem in which, given an input raw sentence, a
model automatically produces a structured output that represents its syntactic structure. The
most common formalisms in this field are constituent and dependency parsing. Although
both formalisms show differences, they also share limitations, in particular the limited speed
of the models to obtain the desired representation, and the lack of a common representation
that allows any end-to-end neural system to obtain those models. Transforming both parsing
tasks into a sequence labeling task solves both of these problems. Several tree linearizations
have been proposed in the last few years, however there is no common suite that facilitates
their use under an integrated framework. In this work, we will develop such a system. On the
one hand, the system will be able to: (i) encode syntactic trees according to the desired syntactic
formalism and linearization function, and (ii) decode linearized trees into their original
representation. On the other hand, (iii) we will also train several neural sequence labeling
systems to perform parsing from those labels, and we will compare the results.[Resumen]: El análisis sintáctico es una tarea central dentro del procesado del lenguaje natural, en
el que dada una oración se produce una salida que representa su estructura sintáctica. Los
formalismos más populares son el de constituyentes y el de dependencias. Aunque son fundamentalmente
diferentes, tienen ciertas limitaciones en común, como puede ser la lentitud
de los modelos empleados para su predicción o la falta de una representación común que permita
predecirlos con sistemas neuronales de uso general. Transformar ambos formalismos a
una tarea de etiquetado de secuencias permite resolver ambos problemas. Durante los últimos
años se han propuesto diferentes maneras de linearizar árboles sintácticos, pero todavÃa
se carecÃa de un software unificado que permitiese obtener representaciones para ambos formalismos
sobre un mismo sistema. En este trabajo se desarrollará dicho sistema. Por un lado,
éste permitirá: (i) linearizar árboles sintácticos en el formalismo y función de linearización
deseadas y (ii) decodificar árboles linearizados de vuelta a su formato original. Por otro lado,
también se entrenarán varios modelos de etiquetado de secuencias, y se compararán los resultados
obtenidos.Traballo fin de grao (UDC.FIC). EnxeñarÃa Informática. Curso 2021/202