25 research outputs found
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
Syntactic Parsing versus MWEs: What can fMRI signal tell us
International audienc
Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing
International audienceIn this paper, we investigate various strategies to predict both syntactic dependency parsing and contiguous multiword expression (MWE) recognition, testing them on the dependency version of French Treebank \cite{abeille:04}, as instantiated in the SPMRL Shared Task \cite{spmrl:st:2013}. Our work focuses on using an alternative representation of syntactically regular MWEs, which captures their syntactic internal structure. We obtain a system with comparable performance to that of previous works on this dataset, but which predicts both syntactic dependencies and the internal structure of MWEs. This can be useful for capturing the various degrees of semantic compositionality of MWEs
A case study in tagging case in german: an assessment of statistical approaches
In this study, we assess the performance of purely statistical approaches using supervised machine learning for predicting case in German (nominative, accusative, dative, genitive, n/a). We experiment with two different treebanks containing morphological annotations: TIGER and TUEBA. An evaluation with 10-fold cross-validation serves as the basis for systematic comparisons of the optimal parametrizations of different approaches. We test taggers based on Hidden Markov Models (HMM), Decision Trees, and Conditional Random Fields (CRF). The CRF approach based on our hand-crafted feature model achieves an accuracy of about 94%. This outperforms all other approaches and results in an improvement of 11% compared to a baseline HMM trigram tagger and an improvement of 2% compared to a state-of-the-art tagger for rich morphological tagsets. Moreover, we investigate the effect of additional (morphological) categories (gender, number, person, part of speech) in the internal tagset used for the training. Rich internal tagsets improve results for all tested approaches
Multiword expressions at length and in depth
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Ensemble Morphosyntactic Analyser for Classical Arabic
Classical Arabic (CA) is an influential language for Muslim lives around the
world. It is the language of two sources of Islamic laws: the Quran and the Sunnah,
the collection of traditions and sayings attributed to the prophet Mohammed.
However, classical Arabic in general, and the Sunnah, in particular, is underexplored and under-resourced in the field of computational linguistics. This study examines the possible directions for adapting existing tools, specifically morphological analysers, designed for modern standard Arabic (MSA) to classical Arabic.
Morphological analysers of CA are limited, as well as the data for evaluating them. In this study, we adapt existing analysers and create a validation data-set from
the Sunnah books. Inspired by the advances in deep learning and the promising
results of ensemble methods, we developed a systematic method for transferring
morphological analysis that is capable of handling different labelling systems and
various sequence lengths.
In this study, we handpicked the best four open access MSA morphological analysers. Data generated from these analysers are evaluated before and after adaptation through the existing Quranic Corpus and the Sunnah Arabic Corpus. The findings are as follows: first, it is feasible to analyse under-resourced languages using existing comparable language resources given a small sufficient set of annotated text. Second, analysers typically generate different errors and this could be exploited. Third, an explicit alignment of sequences and the mapping of labels is not necessary to achieve comparable accuracies given a sufficient size of training dataset.
Adapting existing tools is easier than creating tools from scratch. The resulting quality is dependent on training data size and number and quality of input taggers. Pipeline architecture performs less well than the End-to-End neural network architecture due to error propagation and limitation on the output format. A valuable tool and data for annotating classical Arabic is made freely available
Extended papers from the MWE 2017 workshop
The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide.
This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
Automatic identification and translation of multiword expressions
A thesis submitted in partial fulfilment of the requirements of the
University of Wolverhampton for the degree of Doctor of Philosophy.Multiword Expressions (MWEs) belong to a class of phraseological phenomena
that is ubiquitous in the study of language. They are heterogeneous
lexical items consisting of more than one word and feature lexical, syntactic,
semantic and pragmatic idiosyncrasies. Scholarly research on MWEs benefits
both natural language processing (NLP) applications and end users.
This thesis involves designing new methodologies to identify and translate
MWEs. In order to deal with MWE identification, we first develop datasets
of annotated verb-noun MWEs in context. We then propose a method which
employs word embeddings to disambiguate between literal and idiomatic usages
of the verb-noun expressions. Existence of expression types with various
idiomatic and literal distributions leads us to re-examine their modelling and
evaluation. We propose a type-aware train and test splitting approach to
prevent models from overfitting and avoid misleading evaluation results.
Identification of MWEs in context can be modelled with sequence tagging
methodologies. To this end, we devise a new neural network architecture,
which is a combination of convolutional neural networks and long-short
term memories with an optional conditional random field layer on top. We
conduct extensive evaluations on several languages demonstrating a better
performance compared to the state-of-the-art systems. Experiments show that the generalisation power of the model in predicting unseen MWEs is significantly better than previous systems.
In order to find translations for verb-noun MWEs, we propose a bilingual
distributional similarity approach derived from a word embedding model that
supports arbitrary contexts. The technique is devised to extract translation
equivalents from comparable corpora which are an alternative resource to
costly parallel corpora. We finally conduct a series of experiments to investigate
the effects of size and quality of comparable corpora on automatic
extraction of translation equivalents
A Hybrid Framework for Text Analysis
2015 - 2016In Computational Linguistics there is an essential dichotomy between Linguists
and Computer Scientists. The rst ones, with a strong knowledge of
language structures, have not engineering skills. The second ones, contrariwise,
expert in computer and mathematics skills, do not assign values to basic
mechanisms and structures of language. Moreover, this discrepancy, especially
in the last decades, has increased due to the growth of computational
resources and to the gradual computerization of the world; the use of Machine
Learning technologies in Arti cial Intelligence problems solving, which
allows for example the machines to learn , starting from manually generated
examples, has been more and more often used in Computational Linguistics
in order to overcome the obstacle represented by language structures and its
formal representation.
The dichotomy has resulted in the birth of two main approaches to Computational
Linguistics that respectively prefers:
rule-based methods, that try to imitate the way in which man uses and
understands the language, reproducing syntactic structures on which
the understanding process is based on, building lexical resources as electronic
dictionaries, taxonomies or ontologies;
statistic-based methods that, conversely, treat language as a group of
elements, quantifying words in a mathematical way and trying to extract
information without identifying syntactic structures or, in some
algorithms, trying to confer to the machine the ability to learn these
structures.
One of the main problems is the lack of communication between these two
di erent approaches, due to substantial di erences characterizing them: on
the one hand there is a strong focus on how language works and on language
characteristics, there is a tendency to analytical and manual work. From other
hand, engineering perspective nds in language an obstacle, and recognizes in
the algorithms the fastest way to overcome this problem.
However, the lack of communication is not only an incompatibility: following
Harris, the best way to approach natural language, could result by taking the
best of both.
At the moment, there is a large number of open-source tools that perform
text analysis and Natural Language Processing. A great part of these tools are
based on statistical models and consist on separated modules which could be
combined in order to create a pipeline for the processing of the text. Many of these resources consist in code packages which have not a GUI (Graphical User
Interface) and they result impossible to use for users without programming
skills. Furthermore, the vast majority of these open-source tools support only
English language and, when Italian language is included, the performances
of the tools decrease signi cantly. On the other hand, open source tools for
Italian language are very few.
In this work we want to ll this gap by present a new hybrid framework
for the analysis of Italian texts. It must not be intended as a commercial tool,
but the purpose for which it was built is to help linguists and other scholars to
perform rapid text analysis and to produce linguistic data. The framework,
that performs both statistical and rule-based analysis, is called LG-Starship.
The idea is to built a modular software that includes, in the beginning, the
basic algorithms to perform di erent kind of analysis. Modules will perform
the following tasks:
Preprocessing Module: a module with which it is possible to charge a
text, normalize it or delete stop-words. As output, the module presents
the list of tokens and letters which compose the texts with respective
occurrences count and the processed text.
Mr. Ling Module: a module with which POS tagging and Lemmatization
are performed. The module also returns the table of lemmas
with the count of occurrences and the table with the quanti cation of
grammatical tags.
Statistic Module: with which it is possible to calculate Term Frequency
and TF-IDF of tokens or lemmas, extract bi-grams and tri-grams units
and export results as tables.
Semantic Module: which use The Hyperspace Analogue to Language
algorithm to calculate semantic similarity between words. The module
returns similarity matrices of words per word which can be exported
and analyzed.
SyntacticModule: which analyze syntax structures of a selected sentence
and tag the verbs and its arguments with semantic labels.
The objective of the Framework is to build an all-in-one platform for NLP
which allows any kind of users to perform basic and advanced text analysis.
With the purpose of make the Framework accessible to users who have not
speci c computer science and programming language skills, the modules have
been provided with an intuitive GUI. The framework can be considered hybrid in a double sense: as explained
in the previous lines, it uses both statistical and rule/based methods, by relying
on standard statistical algorithms or techniques, and, at the same time,
on Lexicon-Grammar syntactic theory. In addition, it has been written in
both Java and Python programming languages. LG-Starship Framework has
a simple Graphic User Interface but will be also released as separated modules
which may be included in any NLP pipelines independently.
There are many resources of this kind, but the large majority works for English.
There are very few free resources for Italian language and this work tries
to cover this need by proposing a tool which can be used both by linguists
or other scientist interested in language and text analysis who have no idea
about programming languages, as by computer scientists, who can use free
modules in their own code or in combination with di erent NLP algorithms.
The Framework takes the start from a text or corpus written directly by
the user or charged from an external resource. The LG-Starship Framework
work ow is described in the owchart shown in g. 1. The pipeline shows that the Pre-Processing Module is applied on original
imported or generated text in order to produce a clean and normalized preprocessed
text. This module includes a function for text splitting, a stop-word
list and a tokenization method. On the text preprocessed the Statistic Module
or the Mr. Ling Module can be applied. The rst one, which includes basic statistics algorithm as Term Frequency, tf-idf and n-grams extraction, produces
as output databases of lexical and numerical data which can be used to
produce charts or perform more external analysis; the second one, is divided
in two main task: a Pos tagger, based on the Averaged Perceptron Tagger [?]
and trained on the PaisĂ Corpus [Lyding et al., 2014], perform the Part-Of-
Speech Tagging and produce an annotated text. A lemmatization method,
which relies on a set of electronic dictionaries developed at the University of
Salerno [Elia, 1995, Elia et al., 2010], take as input the Postagged text and
produces a new lemmatized version of original text with information about
syntactic and semantic properties.
This lemmatized text, which can also be processed with the Statistic Module,
serves as input for two deeper level of text analysis carried out by both
the Syntactic Module and the Semantic Module.
The rst one lays on the Lexicon Grammar Theory [Gross, 1971, 1975] and
use a database of Predicate structures in development at the Department of
Political, Social and Communication Science. Its objective is to produce a
Dependency Graph of the sentences that compose the text.
The Semantic Module uses the Hyperspace Analogue to Language distributional
semantics algorithm [Lund and Burgess, 1996] trained on the PaisĂ
Corpus to produce a semantic network of the words of the text.
These work ow has been included in two di erent experiments in which
two User Generated Corpora have been involved.
The rst experiment represent a statistical study of the language of Rap
Music in Italy through the analysis of a great corpus of Rap Song lyrics downloaded
from on line databases of user generated lyrics.
The second experiment is a Feature-Based Sentiment Analysis project performed
on user product reviews. For this project we integrated a large domain
database of linguistic resources for Sentiment Analysis, developed in the past
years by the Department of Political, Social and Communication Science of
the University of Salerno, which consists of polarized dictionaries of Verbs,
Adjectives, Adverbs and Nouns.
These two experiment underline how the linguistic framework can be applied
to di erent level of analysis and to produce both Qualitative data and Quantitative
data.
For what concern the obtained results, the Framework, which is only at
a Beta Version, obtain discrete results both in terms of processing time that
in terms of precision. Nevertheless, the work is far from being considered
complete. More algorithms will be added to the Statistic Module and the
Syntactic Module will be completed. The GUI will be improved and made more attractive and modern and, in addiction, an open-source on-line version
of the modules will be published. [edited by author]XV n.s
Understanding the structure and meaning of Finnish texts: From corpus creation to deep language modelling
Natural Language Processing (NLP) is a cross-disciplinary field combining elements of computer science, artificial intelligence, and linguistics, with the objective of developing means for computational analysis, understanding or generation of human language. The primary aim of this thesis is to advance natural language processing in Finnish by providing more resources and investigating the most effective machine learning based practices for their use. The thesis focuses on NLP topics related to understanding the structure and meaning of written language, mainly concentrating on structural analysis (syntactic parsing) as well as exploring the semantic equivalence of statements that vary in their surface realization (paraphrase modelling). While the new resources presented in the thesis are developed for Finnish, most of the methodological contributions are language-agnostic, and the accompanying papers demonstrate the application and evaluation of these methods across multiple languages.
The first set of contributions of this thesis revolve around the development of a state-of-the-art Finnish dependency parsing pipeline. Firstly, the necessary Finnish training data was converted to the Universal Dependencies scheme, integrating Finnish into this important treebank collection and establishing the foundations for Finnish UD parsing. Secondly, a novel word lemmatization method based on deep neural networks is introduced and assessed across a diverse set of over 50 languages. And finally, the overall dependency parsing pipeline is evaluated on a large number of languages, securing top ranks in two competitive shared tasks focused on multilingual dependency parsing. The overall outcome of this line of research is a parsing pipeline reaching state-of-the-art accuracy in Finnish dependency parsing, the parsing numbers obtained with the latest pre-trained language models approaching (at least near) human-level performance.
The achievement of large language models in the area of dependency parsing— as well as in many other structured prediction tasks— brings up the hope of the large pre-trained language models genuinely comprehending language, rather than merely relying on simple surface cues. However, datasets designed to measure semantic comprehension in Finnish have been non-existent, or very scarce at the best. To address this limitation, and to reflect the general change of emphasis in the field towards task more semantic in nature, the second part of the thesis shifts its focus to language understanding through an exploration of paraphrase modelling. The second contribution of the thesis is the creation of a novel, large-scale, manually annotated corpus of Finnish paraphrases. A unique aspect of this corpus is that its examples have been manually extracted from two related text documents, with the objective of obtaining non-trivial paraphrase pairs valuable for training and evaluating various language understanding models on paraphrasing. We show that manual paraphrase extraction can yield a corpus featuring pairs that are both notably longer and less lexically overlapping than those produced through automated candidate selection, the current prevailing practice in paraphrase corpus construction. Another distinctive feature in the corpus is that the paraphrases are identified and distributed within their document context, allowing for richer modelling and novel tasks to be defined