131 research outputs found

    Evaluating Pretrained Transformer-based Models on the Task of Fine-Grained Named Entity Recognition

    Get PDF
    Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task and has remained an active research field. In recent years, transformer models and more specifically the BERT model developed at Google revolutionised the field of NLP. While the performance of transformer-based approaches such as BERT has been studied for NER, there has not yet been a study for the fine-grained Named Entity Recognition (FG-NER) task. In this paper, we compare three transformer-based models (BERT, RoBERTa, and XLNet) to two non-transformer-based models (CRF and BiLSTM-CNN-CRF). Furthermore, we apply each model to a multitude of distinct domains. We find that transformer-based models incrementally outperform the studied non-transformer-based models in most domains with respect to the F1 score. Furthermore, we find that the choice of domains significantly influenced the performance regardless of the respective data size or the model chosen

    Enhancing Text-to-SQL Translation for Financial System Design

    Get PDF
    Text-to-SQL, the task of translating natural language questions into SQL queries, is part of various business processes. Its automation, which is an emerging challenge, will empower software practitioners to seamlessly interact with relational databases using natural language, thereby bridging the gap between business needs and software capabilities. In this paper, we consider Large Language Models (LLMs), which have achieved state of the art for various NLP tasks. Specifically, we benchmark Text-to-SQL performance, the evaluation methodologies, as well as input optimization (e.g., prompting). In light of the empirical observations that we have made, we propose two novel metrics that were designed to adequately measure the similarity between SQL queries. Overall, we share with the community various findings, notably on how to select the right LLM on Text-to-SQL tasks. We further demonstrate that a tree-based edit distance constitutes a reliable metric for assessing the similarity between generated SQL queries and the oracle for benchmarking Text2SQL approaches. This metric is important as it relieves researchers from the need to perform computationally expensive experiments such as executing generated queries as done in prior works. Our work implements financial domain use cases and, therefore contributes to the advancement of Text2SQL systems and their practical adoption in this domain

    Highly phosphorescent perfect green emitting iridium(III) complex for application in OLEDs

    Get PDF
    A novel iridium complex, [bis-(2-phenylpyridine)(2-carboxy-4-dimethylaminopyridine)iridium(III)] (N984), was synthesized and characterized using spectroscopic and electrochemical methods; a solution processable OLED device incorporating the N984 complex displays electroluminescence spectra with a narrow bandwidth of 70 nm at half of its intensity, with colour coordinates of x = 0.322; y = 0.529 that are very close to those suggested by the PAL standard for a green emitter.Bolink, Henk, [email protected] ; Coronado Miralles, Eugenio, [email protected] ; Garcia Santamaria, Sonsoles Amor, [email protected]

    Evaluating the Impact of Text De-Identification on Downstream NLP Tasks

    Get PDF
    peer reviewedData anonymisation is often required to comply with regulations when transfering information across departments or entities. However, the risk is that this procedure can distort the data and jeopardise the models built on it. Intuitively, the process of training an NLP model on anonymised data may lower the performance of the resulting model when compared to a model trained on non-anonymised data. In this paper, we investigate the impact of de-identification on the performance of nine downstream NLP tasks. We focus on the anonymisation and pseudonymisation of personal names and compare six different anonymisation strategies for two state-of-the-art pre-trained models. Based on these experiments, we formulate recommendations on how the de-identification should be performed to guarantee accurate NLP models. Our results reveal that de-identification does have a negative impact on the performance of NLP models, but this impact is relatively low. We also find that using pseudonymisation techniques involving random names leads to better performance across most tasks.Multilingual Nlp Coping With Luxembourg Specificities For The Financial Industr

    Universality of clone dynamics during tissue development.

    Get PDF
    The emergence of complex organs is driven by the coordinated proliferation, migration and differentiation of precursor cells. The fate behaviour of these cells is reflected in the time evolution their progeny, termed clones, which serve as a key experimental observable. In adult tissues, where cell dynamics is constrained by the condition of homeostasis, clonal tracing studies based on transgenic animal models have advanced our understanding of cell fate behaviour and its dysregulation in disease (1, 2). But what can be learned from clonal dynamics in development, where the spatial cohesiveness of clones is impaired by tissue deformations during tissue growth? Drawing on the results of clonal tracing studies, we show that, despite the complexity of organ development, clonal dynamics may converge to a critical state characterized by universal scaling behaviour of clone sizes. By mapping clonal dynamics onto a generalization of the classical theory of aerosols, we elucidate the origin and range of scaling behaviours and show how the identification of universal scaling dependences may allow lineage-specific information to be distilled from experiments. Our study shows the emergence of core concepts of statistical physics in an unexpected context, identifying cellular systems as a laboratory to study non-equilibrium statistical physics.Wellcome Trus

    flowLearn: Fast and precise identification and quality checking of cell populations in flow cytometry

    Get PDF
    Lux M, Brinkman RR, Chauve C, et al. flowLearn: Fast and precise identification and quality checking of cell populations in flow cytometry. Bioinformatics. 2018;34(13):2245-2253.Motivation Identification of cell populations in flow cytometry is a critical part of the analysis and lays the groundwork for many applications and research discovery. The current paradigm of manual analysis is time consuming and subjective. A common goal of users is to replace manual analysis with automated methods that replicate their results. Supervised tools provide the best performance in such a use case, however they require fine parameterization to obtain the best results. Hence, there is a strong need for methods that are fast to setup, accurate and interpretable. Results flowLearn is a semi-supervised approach for the quality-checked identification of cell populations. Using a very small number of manually gated samples, through density alignments it is able to predict gates on other samples with high accuracy and speed. On two state-of-the-art data sets, our tool achieves median(F1)-measures exceeding 0.99 for 31%, and 0.90 for 80% of all analyzed populations. Furthermore, users can directly interpret and adjust automated gates on new sample files to iteratively improve the initial training

    BlastR—fast and accurate database searches for non-coding RNAs

    Get PDF
    We present and validate BlastR, a method for efficiently and accurately searching non-coding RNAs. Our approach relies on the comparison of di-nucleotides using BlosumR, a new log-odd substitution matrix. In order to use BlosumR for comparison, we recoded RNA sequences into protein-like sequences. We then showed that BlosumR can be used along with the BlastP algorithm in order to search non-coding RNA sequences. Using Rfam as a gold standard, we benchmarked this approach and show BlastR to be more sensitive than BlastN. We also show that BlastR is both faster and more sensitive than BlastP used with a single nucleotide log-odd substitution matrix. BlastR, when used in combination with WU-BlastP, is about 5% more accurate than WU-BlastN and about 50 times slower. The approach shown here is equally effective when combined with the NCBI-Blast package. The software is an open source freeware available from www.tcoffee.org/blastr.html
    corecore