127 research outputs found
Universal Word Segmentation: Implementation and Interpretation
Word segmentation is a low-level NLP taskt hat is non-trivial for a considerable number of languages. In this paper, we present asequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work
Revisiting the challenges and surveys in text similarity matching and detection methods
The massive amount of information from the internet has revolutionized the field of natural language processing. One of the challenges was estimating the similarity between texts. This has been an open research problem although various studies have proposed new methods over the years. This paper surveyed and traced the primary studies in the field of text similarity. The aim was to give a broad overview of existing issues, applications, and methods of text similarity research. This paper identified four issues and several applications of text similarity matching. It classified current studies based on intrinsic, extrinsic, and hybrid approaches. Then, we identified the methods and classified them into lexical-similarity, syntactic-similarity, semantic-similarity, structural-similarity, and hybrid. Furthermore, this study also analyzed and discussed method improvement, current limitations, and open challenges on this topic for future research directions
CoNLL 2017 Shared Task : Multilingual Parsing from Raw Text to Universal Dependencies
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.Peer reviewe
Viability of Sequence Labeling Encodings for Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents new methods for recasting dependency parsing as
a sequence labeling task yielding a viable alternative to the traditional
transition- and graph-based approaches. It is shown that sequence labeling
parsers provide several advantages for dependency parsing, such
as: (i) a good trade-off between accuracy and parsing speed, (ii) genericity
which enables running a parser in generic sequence labeling software
and (iii) pluggability which allows using full parse trees as features to
downstream tasks.
The backbone of dependency parsing as sequence labeling are the encodings
which serve as linearization methods for mapping dependency
trees into discrete labels, such that each token in a sentence is associated
with a label. We introduce three encoding families comprising: (i)
head selection, (ii) bracketing-based and (iii) transition-based encodings
which are differentiated by the way they represent a dependency
tree as a sequence of labels. We empirically examine the viability of
the encodings and provide an analysis of their facets.
Furthermore, we explore the feasibility of leveraging external complementary
data in order to enhance parsing performance. Our sequence
labeling parser is endowed with two kinds of representations. First,
we exploit the complementary nature of dependency and constituency
parsing paradigms and enrich the parser with representations from both
syntactic abstractions. Secondly, we use human language processing
data to guide our parser with representations from eye movements.
Overall, the results show that recasting dependency parsing as sequence
labeling is a viable approach that is fast and accurate and provides
a practical alternative for integrating syntax in NLP tasks.[Resumen]
Esta tesis presenta nuevos métodos para reformular el análisis sintáctico
de dependencias como una tarea de etiquetado secuencial, lo
que supone una alternativa viable a los enfoques tradicionales basados
en transiciones y grafos. Se demuestra que los analizadores de etiquetado
secuencial ofrecen varias ventajas para el análisis sintáctico de
dependencias, como por ejemplo (i) un buen equilibrio entre la precisión
y la velocidad de análisis, (ii) la genericidad que permite ejecutar
un analizador en un software genérico de etiquetado secuencial y (iii)
la conectividad que permite utilizar el árbol de análisis completo como
características para las tareas posteriores.
El pilar del análisis sintáctico de dependencias como etiquetado secuencial
son las codificaciones que sirven como métodos de linealización
para transformar los árboles de dependencias en etiquetas discretas, de
forma que cada token de una frase se asocia con una etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de
núcleos, (ii) codificaciones basadas en corchetes y (iii) codificaciones basadas
en transiciones que se diferencian por la forma en que representan
un árbol de dependencias como una secuencia de etiquetas. Examinamos
empíricamente la viabilidad de las codificaciones y ofrecemos un
análisis de sus facetas.
Además, exploramos la viabilidad de aprovechar datos complementarios
externos para mejorar el rendimiento del análisis sintáctico. Dotamos
a nuestro analizador sintáctico de dos tipos de representaciones. En
primer lugar, explotamos la naturaleza complementaria de los paradigmas
de análisis sintáctico de dependencias y constituyentes, enriqueciendo
el analizador sintáctico con representaciones de ambas abstracciones
sintácticas. En segundo lugar, utilizamos datos de procesamiento del
lenguaje humano para guiar nuestro analizador con representaciones de
los movimientos oculares.
En general, los resultados muestran que la reformulación del análisis
sintáctico de dependencias como etiquetado de secuencias es un enfoque
viable, rápido y preciso, y ofrece una alternativa práctica para integrar
la sintaxis en las tareas de PLN.[Resumo]
Esta tese presenta novos métodos para reformular a análise sintáctica
de dependencias como unha tarefa de etiquetaxe secuencial, o que
supón unha alternativa viable aos enfoques tradicionais baseados en
transicións e grafos. Demóstrase que os analizadores de etiquetaxe secuencial
ofrecen varias vantaxes para a análise sintáctica de dependencias,
por exemplo (i) un bo equilibrio entre a precisión e a velocidade
de análise, (ii) a xenericidade que permite executar un analizador nun
software xenérico de etiquetaxe secuencial e (iii) a conectividade que
permite empregar a árbore de análise completa como características
para as tarefas posteriores.
O piar da análise sintáctica de dependencias como etiquetaxe secuencial
son as codificacións que serven como métodos de linealización para
transformar as árbores de dependencias en etiquetas discretas, de forma
que cada token dunha frase se asocia cunha etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de núcleos,
(ii) codificacións baseadas en corchetes e (iii) codificacións baseadas en
transicións que se diferencian pola forma en que representan unha árbore
de dependencia como unha secuencia de etiquetas. Examinamos
empíricamente a viabilidade das codificacións e ofrecemos unha análise
das súas facetas.
Ademais, exploramos a viabilidade de aproveitar datos complementarios
externos para mellorar o rendemento da análise sintáctica. O noso
analizador sintáctico de etiquetaxe secuencial está dotado de dous tipos
de representacións. En primeiro lugar, explotamos a natureza complementaria
dos paradigmas de análise sintáctica de dependencias e constituíntes
e enriquecemos o analizador sintáctico con representacións de
ambas abstraccións sintácticas. En segundo lugar, empregamos datos
de procesamento da linguaxe humana para guiar o noso analizador con
representacións dos movementos oculares.
En xeral, os resultados mostran que a reformulación da análise sintáctico
de dependencias como etiquetaxe de secuencias é un enfoque
viable, rápido e preciso, e ofrece unha alternativa práctica para integrar
a sintaxe nas tarefas de PLN.This work has been carried out thanks to the funding from
the European Research Council (ERC), under the European Union’s
Horizon 2020 research and innovation programme (FASTPARSE, grant
agreement No 714150)
Towards a machine-learning architecture for lexical functional grammar parsing
Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also
recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages.
The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems. Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing.
In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able
to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously
lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages
- …