271 research outputs found
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
Discovery of sensitive data with natural language processing
The process of protecting sensitive data is continually growing and becoming increasingly important,
especially as a result of the directives and laws imposed by the European Union. The effort
to create automatic systems is continuous, but in most cases, the processes behind them are
still manual or semi-automatic. In this work, we have developed a component that can extract
and classify sensitive data, from unstructured text information in European Portuguese. The
objective was to create a system that allows organizations to understand their data and comply
with legal and security purposes. We studied a hybrid approach to the problem of Named
Entities Recognition for the Portuguese language. This approach combines several techniques
such as rule-based/lexical-based models, machine learning algorithms and neural networks. The
rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining
classes of entities, SpaCy and Stanford NLP tools were tested, two statistical models –
Conditional Random Fields and Random Forest – were implemented and, finally, a Bidirectional-
LSTM approach as experimented. The best results were achieved with the Stanford NER model
(86.41%), from the Stanford NLP tool. Regarding the statistical models, we realized that Conditional
Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With
the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and
testing were HAREM Golden Collection, SIGARRA News Corpus and DataSense NER Corpus.O processo de preservação de dados sensĂveis está em constante crescimento e cada vez apresenta
maior importância, proveniente especialmente das diretivas e leis impostas pela União Europeia.
O esforço para criar sistemas automáticos Ă© contĂnuo, mas o processo Ă© realizado na maioria dos
casos de forma manual ou semiautomática. Neste trabalho desenvolvemos um componente de
Extração e Classificação de dados sensĂveis, que processa textos nĂŁo-estruturados em PortuguĂŞs
Europeu. O objetivo consistiu em criar um sistema que permite às organizações compreender
os seus dados e cumprir com fins legais de conformidade e segurança. Para resolver este problema,
foi estudada uma abordagem hĂbrida de Reconhecimento de Entidades Mencionadas para
a lĂngua Portuguesa. Esta abordagem combina tĂ©cnicas baseadas em regras e lĂ©xicos, algoritmos
de aprendizagem automática e redes neuronais. As primeiras abordagens baseadas em regras e
léxicos, foram utilizadas apenas para um conjunto de classes especificas. Para as restantes classes
de entidades foram utilizadas as ferramentas SpaCy e Stanford NLP, testados dois modelos estatĂsticos
— Conditional Random Fields e Random Forest – e por fim testada uma abordagem
baseada em redes neuronais – Bidirectional-LSTM. Ao nĂvel das ferramentas utilizadas os melhores
resultados foram conseguidos com o modelo Stanford NER (86,41%). Através dos modelos
estatĂsticos percebemos que o Conditional Random Fields Ă© o que consegue obter melhores resultados,
com um f1-score de 65,50%. Com a Ăşltima abordagem, uma rede neuronal Bi-LSTM,
conseguimos resultado de f1-score de aproximadamente 83,01%. Para o treino e teste das diferentes
abordagens foram utilizados os conjuntos de dados HAREM Golden Collection, SIGARRA
News Corpus e DataSense NER Corpus
Robust input representations for low-resource information extraction
Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jüngsten Fortschritte auf dem Gebiet der Verarbeitung natürlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies führte zu einer Vielzahl neuer Forschungsfragen bezüglich der Stabilität solcher großen Systeme und ihrer Anwendbarkeit über gut untersuchte Aufgaben und Datensätze hinaus, wie z. B. die Informationsextraktion für Nicht-Standardsprachen, aber auch Textdomänen und Aufgaben, für die selbst im Englischen nur wenige Trainingsdaten zur Verfügung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beiträge in Bereichen wie Repräsentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschränkungen zu überwinden, darunter fehlende Trainingsressourcen, ungesehene Domänen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die Domänenlücke zwischen Repräsentationsmodellen zu schließen, z.B. durch domänenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repräsentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsfähigkeit unserer Methoden für verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Domänen
Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization
Translation alignment is an essential task in Digital Humanities and Natural
Language Processing, and it aims to link words/phrases in the source
text with their translation equivalents in the translation. In addition to
its importance in teaching and learning historical languages, translation
alignment builds bridges between ancient and modern languages through
which various linguistics annotations can be transferred. This thesis focuses
on word-level translation alignment applied to historical languages in general
and Ancient Greek and Latin in particular. As the title indicates, the thesis
addresses four interdisciplinary aspects of translation alignment.
The starting point was developing Ugarit, an interactive annotation tool
to perform manual alignment aiming to gather training data to train an
automatic alignment model. This effort resulted in more than 190k accurate
translation pairs that I used for supervised training later. Ugarit has been
used by many researchers and scholars also in the classroom at several
institutions for teaching and learning ancient languages, which resulted
in a large, diverse crowd-sourced aligned parallel corpus allowing us to
conduct experiments and qualitative analysis to detect recurring patterns in
annotators’ alignment practice and the generated translation pairs.
Further, I employed the recent advances in NLP and language modeling to
develop an automatic alignment model for historical low-resourced languages,
experimenting with various training objectives and proposing a training
strategy for historical languages that combines supervised and unsupervised
training with mono- and multilingual texts. Then, I integrated this alignment
model into other development workflows to project cross-lingual annotations
and induce bilingual dictionaries from parallel corpora.
Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined
its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold
standard datasets and support quantitative and qualitative evaluation of
translation alignment models. Besides, I designed and implemented visual
analytics tools and reading environments for parallel texts and proposed
various visualization approaches to support different alignment-related tasks
employing the latest advances in information visualization and best practice.
Overall, this thesis presents a comprehensive study that includes manual and
automatic alignment techniques, evaluation methods and visual analytics
tools that aim to advance the field of translation alignment for historical
languages
Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora
International audienceAutomatic detection of negated content is often a prerequisite in information extraction systems in various domains. In the biomedical domain especially, this task is important because negation plays an important role. In this work, two main contributions are proposed. First, we work with languages which have been poorly addressed up to now: Brazilian Portuguese and French. Thus, we developed new corpora for these two languages which have been manually annotated for marking up the negation cues and their scope. Second, we propose automatic methods based on supervised machine learning approaches for the automatic detection of negation marks and of their scopes. The methods show to be robust in both languages (Brazilian Portuguese and French) and in cross-domain (general and biomedical languages) contexts. The approach is also validated on English data from the state of the art: it yields very good results and outperforms other existing approaches. Besides, the application is accessible and usable online. We assume that, through these issues (new annotated corpora, application accessible online, and cross-domain robustness), the reproducibility of the results and the robustness of the NLP applications will be augmented
- …