2,006 research outputs found
Automatic generation of named entity taggers leveraging parallel corpora
The lack of hand curated data is a major impediment to developing statistical semantic
processors for many of the world languages. A major issue of semantic processors in Nat-
ural Language Processing (NLP) is that they require manually annotated data to perform
accurately. Our work aims to address this issue by leveraging existing annotations and
semantic processors from multiple source languages by projecting their annotations via
statistical word alignments traditionally used in Machine Translation. Taking the Named
Entity Recognition (NER) task as a use case of semantic processing, this work presents
a method to automatically induce Named Entity taggers using parallel data, without any
manual intervention. Our method leverages existing semantic processors and annotations
to overcome the lack of annotation data for a given language. The intuition is to transfer
or project semantic annotations, from multiple sources to a target language, by statistical
word alignment methods applied to parallel texts (Och and Ney, 2000; Liang et al., 2006).
The projected annotations can then be used to automatically generate semantic processors
for the target language. In this way we would be able to provide NLP processors with-
out training data for the target language. The experiments are focused on 4 languages:
German, English, Spanish and Italian, and our empirical evaluation results show that our
method obtains competitive results when compared with models trained on gold-standard
out-of-domain data. This shows that our projection algorithm is effective to transport NER
annotations across languages via parallel data thus providing a fully automatic method to
obtain NER taggers for as many as the number of languages aligned via parallel corpora
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection
Natural language processing (NLP) applications such as named entity
recognition (NER) for low-resource corpora do not benefit from recent advances
in the development of large language models (LLMs) where there is still a need
for larger annotated datasets. This research article introduces a methodology
for generating translated versions of annotated datasets through crosslingual
annotation projection. Leveraging a language agnostic BERT-based approach, it
is an efficient solution to increase low-resource corpora with few human
efforts and by only using already available open data resources. Quantitative
and qualitative evaluations are often lacking when it comes to evaluating the
quality and effectiveness of semi-automatic data generation strategies. The
evaluation of our crosslingual annotation projection approach showed both
effectiveness and high accuracy in the resulting dataset. As a practical
application of this methodology, we present the creation of French Annotated
Resource with Semantic Information for Medical Entities Detection (FRASIMED),
an annotated corpus comprising 2'051 synthetic clinical cases in French. The
corpus is now available for researchers and practitioners to develop and refine
French natural language processing (NLP) applications in the clinical field
(https://zenodo.org/record/8355629), making it the largest open annotated
corpus with linked medical concepts in French
A Cross-Lingual Similarity Measure for Detecting Biomedical Term Translations
Bilingual dictionaries for technical terms such as biomedical terms are an important resource for machine translation systems as well as for humans who would like to understand a concept described in a foreign language. Often a biomedical term is first proposed in English and later it is manually translated to other languages. Despite the fact that there are large monolingual lexicons of biomedical terms, only a fraction of those term lexicons are translated to other languages. Manually compiling large-scale bilingual dictionaries for technical domains is a challenging task because it is difficult to find a sufficiently large number of bilingual experts. We propose a cross-lingual similarity measure for detecting most similar translation candidates for a biomedical term specified in one language (source) from another language (target). Specifically, a biomedical term in a language is represented using two types of features: (a) intrinsic features that consist of character n-grams extracted from the term under consideration, and (b) extrinsic features that consist of unigrams and bigrams extracted from the contextual windows surrounding the term under consideration. We propose a cross-lingual similarity measure using each of those feature types. First, to reduce the dimensionality of the feature space in each language, we propose prototype vector projection (PVP)āa non-negative lower-dimensional vector projection method. Second, we propose a method to learn a mapping between the feature spaces in the source and target language using partial least squares regression (PLSR). The proposed method requires only a small number of training instances to learn a cross-lingual similarity measure. The proposed PVP method outperforms popular dimensionality reduction methods such as the singular value decomposition (SVD) and non-negative matrix factorization (NMF) in a nearest neighbor prediction task. Moreover, our experimental results covering several language pairs such as EnglishāFrench, EnglishāSpanish, EnglishāGreek, and EnglishāJapanese show that the proposed method outperforms several other feature projection methods in biomedical term translation prediction tasks
Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon
This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for diļ¬erent languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are
extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which aļ¬ects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The diļ¬erent steps of the procedure (mapping, disambiguation, extraction, NE identiļ¬cation and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the systemās accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented
Event Extraction: A Survey
Extracting the reported events from text is one of the key research themes in
natural language processing. This process includes several tasks such as event
detection, argument extraction, role labeling. As one of the most important
topics in natural language processing and natural language understanding, the
applications of event extraction spans across a wide range of domains such as
newswire, biomedical domain, history and humanity, and cyber security. This
report presents a comprehensive survey for event detection from textual
documents. In this report, we provide the task definition, the evaluation
method, as well as the benchmark datasets and a taxonomy of methodologies for
event extraction. We also present our vision of future research direction in
event detection.Comment: 20 page
T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks
In the absence of readily available labeled data for a given sequence
labeling task and language, annotation projection has been proposed as one of
the possible strategies to automatically generate annotated data. Annotation
projection has often been formulated as the task of transporting, on parallel
corpora, the labels pertaining to a given span in the source language into its
corresponding span in the target language. In this paper we present
T-Projection, a novel approach for annotation projection that leverages large
pretrained text-to-text language models and state-of-the-art machine
translation technology. T-Projection decomposes the label projection task into
two subtasks: (i) A candidate generation step, in which a set of projection
candidates using a multilingual T5 model is generated and, (ii) a candidate
selection step, in which the generated candidates are ranked based on
translation probabilities. We conducted experiments on intrinsic and extrinsic
tasks in 5 Indo-European and 8 low-resource African languages. We demostrate
that T-projection outperforms previous annotation projection methods by a wide
margin. We believe that T-Projection can help to automatically alleviate the
lack of high-quality training data for sequence labeling tasks. Code and data
are publicly available.Comment: Findings of the EMNLP 202
- ā¦