11 research outputs found
FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection
Natural language processing (NLP) applications such as named entity
recognition (NER) for low-resource corpora do not benefit from recent advances
in the development of large language models (LLMs) where there is still a need
for larger annotated datasets. This research article introduces a methodology
for generating translated versions of annotated datasets through crosslingual
annotation projection. Leveraging a language agnostic BERT-based approach, it
is an efficient solution to increase low-resource corpora with few human
efforts and by only using already available open data resources. Quantitative
and qualitative evaluations are often lacking when it comes to evaluating the
quality and effectiveness of semi-automatic data generation strategies. The
evaluation of our crosslingual annotation projection approach showed both
effectiveness and high accuracy in the resulting dataset. As a practical
application of this methodology, we present the creation of French Annotated
Resource with Semantic Information for Medical Entities Detection (FRASIMED),
an annotated corpus comprising 2'051 synthetic clinical cases in French. The
corpus is now available for researchers and practitioners to develop and refine
French natural language processing (NLP) applications in the clinical field
(https://zenodo.org/record/8355629), making it the largest open annotated
corpus with linked medical concepts in French
Semi-supervised SRL system with Bayesian inference
International audienceWe propose a new approach to perform semi-supervised training of Semantic Role Labeling models with very few amount of initial labeled data. The proposed approach combines in a novel way supervised and unsupervised training, by forcing the supervised classifier to over-generate potential semantic candidates, and then letting unsupervised inference choose the best ones. Hence, the supervised classifier can be trained on a very small corpus and with coarse-grain features, because its precision does not need to be high: its role is mainly to constrain Bayesian inference to explore only a limited part of the full search space. This approach is evaluated on French and English. In both cases, it achieves very good performance and outperforms a strong supervised baseline when only a small number of annotated sentences is available and even without using any previously trained syntactic parser
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
Scaling up Automatic Cross-Lingual Semantic Role Annotation
Broad-coverage semantic annotations for training statistical learners are only available for a handful of languages. Previous approaches to cross-lingual transfer of semantic annotations have addressed this problem with encouraging results on a small scale. In this paper, we scale up previous efforts by using an automatic approach to semantic annotation that does not rely on a semantic ontology for the target language. Moreover, we improve the quality of the transferred semantic annotations by using a joint syntacticsemantic parser that learns the correlations between syntax and semantics of the target language and smooths out the errors from automatic transfer. We reach a labelled F-measure for predicates and arguments of only 4 % and 9 % points, respectively, lower than the upper bound from manual annotations.
Unsupervised induction of semantic roles
In recent years, a considerable amount of work has been devoted to the task of automatic
frame-semantic analysis. Given the relative maturity of syntactic parsing technology,
which is an important prerequisite, frame-semantic analysis represents a realistic
next step towards broad-coverage natural language understanding and has been
shown to benefit a range of natural language processing applications such as information
extraction and question answering.
Due to the complexity which arises from variations in syntactic realization, data-driven
models based on supervised learning have become the method of choice for this task.
However, the reliance on large amounts of semantically labeled data which is costly
to produce for every language, genre and domain, presents a major barrier to the
widespread application of the supervised approach.
This thesis therefore develops unsupervised machine learning methods, which automatically
induce frame-semantic representations without making use of semantically
labeled data. If successful, unsupervised methods would render manual data annotation
unnecessary and therefore greatly benefit the applicability of automatic framesemantic
analysis.
We focus on the problem of semantic role induction, in which all the argument instances
occurring together with a specific predicate in a corpus are grouped into clusters
according to their semantic role. Our hypothesis is that semantic roles can be induced
without human supervision from a corpus of syntactically parsed sentences, by
leveraging the syntactic relations conveyed through parse trees with lexical-semantic
information.
We argue that semantic role induction can be guided by three linguistic principles. The
first is the well-known constraint that semantic roles are unique within a particular
frame. The second is that the arguments occurring in a specific syntactic position
within a specific linking all bear the same semantic role. The third principle is that
the (asymptotic) distribution over argument heads is the same for two clusters which
represent the same semantic role. We consider two approaches to semantic role induction based on two fundamentally
different perspectives on the problem. Firstly, we develop feature-based probabilistic
latent structure models which capture the statistical relationships that hold between the
semantic role and other features of an argument instance. Secondly, we conceptualize
role induction as the problem of partitioning a graph whose vertices represent argument
instances and whose edges express similarities between these instances. The graph
thus represents all the argument instances for a particular predicate occurring in the
corpus. The similarities with respect to different features are represented on different
edge layers and accordingly we develop algorithms for partitioning such multi-layer
graphs.
We empirically validate our models and the principles they are based on and show that
our graph partitioning models have several advantages over the feature-based models.
In a series of experiments on both English and German the graph partitioning models
outperform the feature-based models and yield significantly better scores over a strong
baseline which directly identifies semantic roles with syntactic positions.
In sum, we demonstrate that relatively high-quality shallow semantic representations
can be induced without human supervision and foreground a promising direction of
future research aimed at overcoming the problem of acquiring large amounts of lexicalsemantic
knowledge