869 research outputs found
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging
In this paper, we propose a new approach to construct a system of
transformation rules for the Part-of-Speech (POS) tagging task. Our approach is
based on an incremental knowledge acquisition method where rules are stored in
an exception structure and new rules are only added to correct the errors of
existing rules; thus allowing systematic control of the interaction between the
rules. Experimental results on 13 languages show that our approach is fast in
terms of training time and tagging speed. Furthermore, our approach obtains
very competitive accuracy in comparison to state-of-the-art POS and
morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the
European Journal on Artificial Intelligence. Version 3: Resubmitted after
major revisions. Version 4: Resubmitted after minor revisions. Version 5: to
appear in AI Communications (accepted for publication on 3/12/2015
Methods for Amharic part-of-speech tagging
The paper describes a set of experiments
involving the application of three state-of-
the-art part-of-speech taggers to Ethiopian
Amharic, using three different tagsets.
The taggers showed worse performance
than previously reported results for Eng-
lish, in particular having problems with
unknown words. The best results were
obtained using a Maximum Entropy ap-
proach, while HMM-based and SVM-
based taggers got comparable results
Discovery of sensitive data with natural language processing
The process of protecting sensitive data is continually growing and becoming increasingly important,
especially as a result of the directives and laws imposed by the European Union. The effort
to create automatic systems is continuous, but in most cases, the processes behind them are
still manual or semi-automatic. In this work, we have developed a component that can extract
and classify sensitive data, from unstructured text information in European Portuguese. The
objective was to create a system that allows organizations to understand their data and comply
with legal and security purposes. We studied a hybrid approach to the problem of Named
Entities Recognition for the Portuguese language. This approach combines several techniques
such as rule-based/lexical-based models, machine learning algorithms and neural networks. The
rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining
classes of entities, SpaCy and Stanford NLP tools were tested, two statistical models –
Conditional Random Fields and Random Forest – were implemented and, finally, a Bidirectional-
LSTM approach as experimented. The best results were achieved with the Stanford NER model
(86.41%), from the Stanford NLP tool. Regarding the statistical models, we realized that Conditional
Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With
the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and
testing were HAREM Golden Collection, SIGARRA News Corpus and DataSense NER Corpus.O processo de preservação de dados sensÃveis está em constante crescimento e cada vez apresenta
maior importância, proveniente especialmente das diretivas e leis impostas pela União Europeia.
O esforço para criar sistemas automáticos é contÃnuo, mas o processo é realizado na maioria dos
casos de forma manual ou semiautomática. Neste trabalho desenvolvemos um componente de
Extração e Classificação de dados sensÃveis, que processa textos não-estruturados em Português
Europeu. O objetivo consistiu em criar um sistema que permite às organizações compreender
os seus dados e cumprir com fins legais de conformidade e segurança. Para resolver este problema,
foi estudada uma abordagem hÃbrida de Reconhecimento de Entidades Mencionadas para
a lÃngua Portuguesa. Esta abordagem combina técnicas baseadas em regras e léxicos, algoritmos
de aprendizagem automática e redes neuronais. As primeiras abordagens baseadas em regras e
léxicos, foram utilizadas apenas para um conjunto de classes especificas. Para as restantes classes
de entidades foram utilizadas as ferramentas SpaCy e Stanford NLP, testados dois modelos estatÃsticos
— Conditional Random Fields e Random Forest – e por fim testada uma abordagem
baseada em redes neuronais – Bidirectional-LSTM. Ao nÃvel das ferramentas utilizadas os melhores
resultados foram conseguidos com o modelo Stanford NER (86,41%). Através dos modelos
estatÃsticos percebemos que o Conditional Random Fields é o que consegue obter melhores resultados,
com um f1-score de 65,50%. Com a última abordagem, uma rede neuronal Bi-LSTM,
conseguimos resultado de f1-score de aproximadamente 83,01%. Para o treino e teste das diferentes
abordagens foram utilizados os conjuntos de dados HAREM Golden Collection, SIGARRA
News Corpus e DataSense NER Corpus
Model Transfer for Tagging Low-resource Languages using a Bilingual Dictionary
Cross-lingual model transfer is a compelling and popular method for
predicting annotations in a low-resource language, whereby parallel corpora
provide a bridge to a high-resource language and its associated annotated
corpora. However, parallel data is not readily available for many languages,
limiting the applicability of these approaches. We address these drawbacks in
our framework which takes advantage of cross-lingual word embeddings trained
solely on a high coverage bilingual dictionary. We propose a novel neural
network model for joint training from both sources of data based on
cross-lingual word embeddings, and show substantial empirical improvements over
baseline techniques. We also propose several active learning heuristics, which
result in improvements over competitive benchmark methods.Comment: 5 pages with 2 pages reference. Accepted to appear in ACL 201
- …