Search CORE

132 research outputs found

Evaluating the Impact of Text De-Identification on Downstream NLP Tasks

Author: Allix Kevin
Bissyandé Tegawendé F.
Boytsov Andrey
Ezzini Saad
Goujon Anne
Klein Jacques
Lebichot Bertrand
Lefebvre Clément
Lothritz Cedric
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

Evaluating Pretrained Transformer-based Models on the Task of Fine-Grained Named Entity Recognition

Author: Allix Kevin
Bissyande Tegawendé François D Assise
Klein Jacques
Lothritz Cedric
Veiber Lisa
Publication venue
Publication date: 01/12/2020
Field of study

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task and has remained an active research field. In recent years, transformer models and more specifically the BERT model developed at Google revolutionised the field of NLP. While the performance of transformer-based approaches such as BERT has been studied for NER, there has not yet been a study for the fine-grained Named Entity Recognition (FG-NER) task. In this paper, we compare three transformer-based models (BERT, RoBERTa, and XLNet) to two non-transformer-based models (CRF and BiLSTM-CNN-CRF). Furthermore, we apply each model to a multitude of distinct domains. We find that transformer-based models incrementally outperform the studied non-transformer-based models in most domains with respect to the F1 score. Furthermore, we find that the choice of domains significantly influenced the performance regardless of the respective data size or the model chosen

Open Repository and Bibliography - Luxembourg

A Comparison of Pre-Trained Language Models for Multi-Class Text Classification in the Financial Domain

Author: Allix Kevin
Arslan Yusuf
Bissyande Tegawendé François D Assise
Goujon Anne
Klein Jacques
Lothritz Cedric
Veiber Lisa
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/04/2021
Field of study

Open Repository and Bibliography - Luxembourg

Enhancing Text-to-SQL Translation for Financial System Design

Author: Bissyandé Tegawendé
Ble Ulrick
Boytsov Andrey
Ezzini Saad
Goujon Anne
Klein Jacques
Lothritz Cedric
Song Yewei
Tang Xunzhu
Publication venue: IEEE/ACM
Publication date: 22/12/2023
Field of study

Text-to-SQL, the task of translating natural language questions into SQL queries, is part of various business processes. Its automation, which is an emerging challenge, will empower software practitioners to seamlessly interact with relational databases using natural language, thereby bridging the gap between business needs and software capabilities. In this paper, we consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks. Specifically, we benchmark Text-to-SQL performance, the evaluation methodologies, as well as input optimization (e.g., prompting). In light of the empirical observations that we have made, we propose two novel metrics that were designed to adequately measure the similarity between SQL queries. Overall, we share with the community various findings, notably on how to select the right LLM on Text-to-SQL tasks. We further demonstrate that a tree-based edit distance constitutes a reliable metric for assessing the similarity between generated SQL queries and the oracle for benchmarking Text2SQL approaches. This metric is important as it relieves researchers from the need to perform computationally expensive experiments such as executing generated queries as done in prior works. Our work implements financial domain use cases and, therefore contributes to the advancement of Text2SQL systems and their practical adoption in this domain

Lancaster E-Prints

Highly phosphorescent perfect green emitting iridium(III) complex for application in OLEDs

Author: Adachi
Baldo
Burroughes
Cedric Klein
Coppo
E. Baranoff
Eugenio Coronado
Gill
Henk J. Bolink
Holder
K. Kalyanasundaram
Köhler
Lamansky
Lowry
Md. K. Nazeeruddin
Michael Graetzel
Michele Sessolo
N. Evans
Nazeeruddin
Plummer
Sajoto
Sonsoles Garcia Santamaria
Sun
Yang
Yang
Yang
You
Publication venue: 'Royal Society of Chemistry (RSC)'
Publication date: 01/01/2007
Field of study

A novel iridium complex, [bis-(2-phenylpyridine)(2-carboxy-4-dimethylaminopyridine)iridium(III)] (N984), was synthesized and characterized using spectroscopic and electrochemical methods; a solution processable OLED device incorporating the N984 complex displays electroluminescence spectra with a narrow bandwidth of 70 nm at half of its intensity, with colour coordinates of x = 0.322; y = 0.529 that are very close to those suggested by the PAL standard for a green emitter.Bolink, Henk, [email protected] ; Coronado Miralles, Eugenio, [email protected] ; Garcia Santamaria, Sonsoles Amor, [email protected]

Infoscience - École polytechnique fédérale de Lausanne

Crossref

University of Birmingham Research Portal

Repositori d'Objectes Digitals per a l'Ensenyament la Recerca i la Cultura

Evaluating the Impact of Text De-Identification on Downstream NLP Tasks

Author: ALLIX Kevin
BISSYANDE Tegawendé François d Assise
Boytsov Andrey
EZZINI Saad
Goujon Anne
KLEIN Jacques
LEBICHOT Bertrand
Lefebvre Clément
LOTHRITZ Cedric
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

peer reviewedData anonymisation is often required to comply with regulations when transfering information across departments or entities. However, the risk is that this procedure can distort the data and jeopardise the models built on it. Intuitively, the process of training an NLP model on anonymised data may lower the performance of the resulting model when compared to a model trained on non-anonymised data. In this paper, we investigate the impact of de-identification on the performance of nine downstream NLP tasks. We focus on the anonymisation and pseudonymisation of personal names and compare six different anonymisation strategies for two state-of-the-art pre-trained models. Based on these experiments, we formulate recommendations on how the de-identification should be performed to guarantee accurate NLP models. Our results reveal that de-identification does have a negative impact on the performance of NLP models, but this impact is relatively low. We also find that using pseudonymisation techniques involving random names leads to better performance across most tasks.Multilingual Nlp Coping With Luxembourg Specificities For The Financial Industr

Open Repository and Bibliography - Luxembourg

Universality of clone dynamics during tissue development.

Author: A Bondue
A Wuidart
AM Klein
AM Klein
Anna Philpott
Benjamin D. Simons
C Blanpain
Cedric Blanpain
Christopher J. Hindley
D Sedmera
E Clayton
EM Hendriks
F Lescroart
Fabienne Lescroart
HJ Snippert
J Schindelin
K Kretzschmar
L Foret
Magdalena K. Sznurkowska
Meritxell Huch
Nicole Prior
S Chabab
S Rulands
Samira Chabab
SJ Morrison
SK Friedlander
SM Meilhac
Steffen Rulands
V Gupta
Y Saga
Publication venue: Nat Phys
Publication date: 01/05/2018
Field of study

The emergence of complex organs is driven by the coordinated proliferation, migration and differentiation of precursor cells. The fate behaviour of these cells is reflected in the time evolution their progeny, termed clones, which serve as a key experimental observable. In adult tissues, where cell dynamics is constrained by the condition of homeostasis, clonal tracing studies based on transgenic animal models have advanced our understanding of cell fate behaviour and its dysregulation in disease (1, 2). But what can be learned from clonal dynamics in development, where the spatial cohesiveness of clones is impaired by tissue deformations during tissue growth? Drawing on the results of clonal tracing studies, we show that, despite the complexity of organ development, clonal dynamics may converge to a critical state characterized by universal scaling behaviour of clone sizes. By mapping clonal dynamics onto a generalization of the classical theory of aerosols, we elucidate the origin and range of scaling behaviours and show how the identification of universal scaling dependences may allow lineage-specific information to be distilled from experiments. Our study shows the emergence of core concepts of statistical physics in an unexpected context, identifying cellular systems as a laboratory to study non-equilibrium statistical physics.Wellcome Trus

arXiv.org e-Print Archive

Crossref

Southampton (e-Prints Soton)

flowLearn: Fast and precise identification and quality checking of cell populations in flow cytometry

Author: Adam Laing
Aghaeepour
Aghaeepour
Amir
Anna Lorenc
Barbara Hammer
Brown
Cedric Chauve
Eulenberg
Finak
Friedman
Hahne
Hennig
Jonathan Wren
Keogh
Klein
Kvistborg
Li
Lisboa
Lucie Abeler-Dörner
Mair
Malek
Markus Lux
Ryan Remy Brinkman
Saeys
Shapiro
Silverman
Upton
Van Der Maaten
Van Gassen
Weber
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Lux M, Brinkman RR, Chauve C, et al. flowLearn: Fast and precise identification and quality checking of cell populations in flow cytometry. Bioinformatics. 2018;34(13):2245-2253.Motivation Identification of cell populations in flow cytometry is a critical part of the analysis and lays the groundwork for many applications and research discovery. The current paradigm of manual analysis is time consuming and subjective. A common goal of users is to replace manual analysis with automated methods that replicate their results. Supervised tools provide the best performance in such a use case, however they require fine parameterization to obtain the best results. Hence, there is a strong need for methods that are fast to setup, accurate and interpretable. Results flowLearn is a semi-supervised approach for the quality-checked identification of cell populations. Using a very small number of manually gated samples, through density alignments it is able to predict gates on other samples with high accuracy and speed. On two state-of-the-art data sets, our tool achieves median(F1)-measures exceeding 0.99 for 31%, and 0.90 for 80% of all analyzed populations. Furthermore, users can directly interpret and adjust automated gates on new sample files to iteratively improve the initial training

Crossref

Publications at Bielefeld University

King's Research Portal

BlastR—fast and accurate database searches for non-coding RNAs

Author: Altschul
Andreas Wilm
Babak
Biegert
Carninci
Cedric Notredame
Clote
Dayhoff
Dowell
Durbin
Eddy
Eddy
Eddy
Emanuele Raineri
Emmanuel Beaudoing
Finn
Freyhult
Gardner
Giovanni Bussotti
Griffiths-Jones
Griffiths-Jones
Griffiths-Jones
Guttman
Henikoff
Ionas Erb
Klein
Lu
Matthias Zytnicki
Menzel
Nawrocki
Notredame
O'Toole
Orom
Park
Philipp Bucher
Ponting
Rinn
Rivas
Roshan
Sankoff
Smith
Vagin
Weinberg
Willingham
Wolfe
Workman
Zhang
Zhang
Zhang
Publication venue: Oxford University Press
Publication date: 01/01/2011
Field of study

We present and validate BlastR, a method for efficiently and accurately searching non-coding RNAs. Our approach relies on the comparison of di-nucleotides using BlosumR, a new log-odd substitution matrix. In order to use BlosumR for comparison, we recoded RNA sequences into protein-like sequences. We then showed that BlosumR can be used along with the BlastP algorithm in order to search non-coding RNA sequences. Using Rfam as a gold standard, we benchmarked this approach and show BlastR to be more sensitive than BlastN. We also show that BlastR is both faster and more sensitive than BlastP used with a single nucleotide log-odd substitution matrix. BlastR, when used in combination with WU-BlastP, is about 5% more accurate than WU-BlastN and about 50 times slower. The approach shown here is equally effective when combined with the NCBI-Blast package. The software is an open source freeware available from www.tcoffee.org/blastr.html

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

Serveur académique lausannois

PubMed Central

HAL Descartes

ProdInra