24 research outputs found
Proposition d'un nouveau modÚle: vers plus de transparence et d'indépendance des agences de notation souveraine
Suite Ă la crise des subprimes et de la dette grecque, les marchĂ©s obligataires ont complĂštement Ă©tĂ© bouleversĂ©s. De nos jours, une grande partie du globe est surendettĂ©e laissant une faible marge de manoeuvre aux politiques budgĂ©taires des pays concernĂ©s. Câest en empruntant autant que ces derniers sont devenus tributaires du mĂ©canisme de lâoffre et de la demande qui fixe les taux dâintĂ©rĂȘt de leurs emprunts. Le risque de crĂ©dit des Ătats souverains notĂ© par les agences joue donc un rĂŽle majeur dans ce processus puisquâun pays risquĂ© paiera des intĂ©rĂȘts plus Ă©levĂ©s quâune Ă©conomie en bonne santĂ©. Ce travail propose une approche analytique du systĂšme des agences de notation. AprĂšs une initiation du cadre des agences, nous conviendrons que de nombreuses incohĂ©rences subsistent dans ce modĂšle. PremiĂšrement, les rĂ©glementations en place aujourdâhui font des agences un acteur essentiel du systĂšme en place ayant un pouvoir considĂ©rable sur les taux de financement des Ătats. Ensuite, la situation du marchĂ© actuel rĂ©partit plus de 95% du marchĂ© entre seulement trois agences dont le premier objectif est de faire du profit. Nous conviendrons finalement que la prĂ©sente organisation laisse trop de place Ă des possibles conflits dâintĂ©rĂȘts au sein des processus et que la communication des agences de notation est en gĂ©nĂ©ral trĂšs opaque quand elle nâest pas approximative voire floue. Partant du principe que les taux dâintĂ©rĂȘt que paient les entitĂ©s souveraines a trait au domaine public et que les agences de notation ont un impact sur ces derniers, nous proposons dans ce travail quatre recommandations visant Ă rendre la notation financiĂšre plus sĂ»re, plus objective et moins volatile. Nous recommandons ainsi en premier lieu la mise en place dâun monopole avec comme seul et unique acteur une agence de notation objective, neutre et indĂ©pendante. Ensuite, nous appuyons la nĂ©cessitĂ© de rendre toute activitĂ© de notation souveraine complĂštement transparente. Pour finir, nous pensons quâil serait judicieux de prendre Ă©galement en compte des substituts dâindicateurs de risque de crĂ©dit ainsi que dâadapter le barĂšme actuel des notes pour rendre la notation plus «lisse» via un barĂšme chiffrĂ© Ă incrĂ©mentation continue
Named entity recognition in chemical patents using ensemble of contextual language models
Chemical patent documents describe a broad range of applications holding key
reaction and compound information, such as chemical structure, reaction
formulas, and molecular properties. These informational entities should be
first identified in text passages to be utilized in downstream tasks. Text
mining provides means to extract relevant information from chemical patents
through information extraction techniques. As part of the Information
Extraction task of the Cheminformatics Elsevier Melbourne University challenge,
in this work we study the effectiveness of contextualized language models to
extract reaction information in chemical patents. We assess transformer
architectures trained on a generic and specialised corpora to propose a new
ensemble model. Our best model, based on a majority ensemble approach, achieves
an exact F1-score of 92.30% and a relaxed F1-score of 96.24%. The results show
that ensemble of contextualized language models can provide an effective method
to extract information from chemical patents
Multilingual RECIST classification of radiology reports using supervised learning.
OBJECTIVES
The objective of this study is the exploration of Artificial Intelligence and Natural Language Processing techniques to support the automatic assignment of the four Response Evaluation Criteria in Solid Tumors (RECIST) scales based on radiology reports. We also aim at evaluating how languages and institutional specificities of Swiss teaching hospitals are likely to affect the quality of the classification in French and German languages.
METHODS
In our approach, 7 machine learning methods were evaluated to establish a strong baseline. Then, robust models were built, fine-tuned according to the language (French and German), and compared with the expert annotation.
RESULTS
The best strategies yield average F1-scores of 90% and 86% respectively for the 2-classes (Progressive/Non-progressive) and the 4-classes (Progressive Disease, Stable Disease, Partial Response, Complete Response) RECIST classification tasks.
CONCLUSIONS
These results are competitive with the manual labeling as measured by Matthew's correlation coefficient and Cohen's Kappa (79% and 76%). On this basis, we confirm the capacity of specific models to generalize on new unseen data and we assess the impact of using Pre-trained Language Models (PLMs) on the accuracy of the classifiers
Algorithmic methods to explore the automation of the appraisal of structured and unstructured digital data
This paper aims to describe an interdisciplinary and innovative research conducted in Switzerland, at the Geneva School of Business Administration HES-SO and supported by the State Archives of NeuchĂątel (Office des archives de l'Ătat de NeuchĂątel, OAEN). The problem to be addressed is one of the most classical ones: how to extract and discriminate relevant data in a huge amount of diversified and complex data record formats and contents. The goal of this study is to provide a framework and a proof of concept for a software that helps taking defensible decisions on the retention and disposal of records and data proposed to the OAEN. For this purpose, the authors designed two axes: the archival axis, to propose archival metrics for the appraisal of structured and unstructured data, and the data mining axis to propose algorithmic methods as complementary or/and additional metrics for the appraisal process
SIB text mining at TREC 2020 deep learning track
This second campaign of the TREC Deep Learning Track was an opportunity for us to experiment with deep neural language models reranking techniques in a realistic use case. This yearâs tasks were the same as the previous edition: (1) building a reranking system and (2) building an end-to-end retrieval system. Both tasks could be completed on both a document and a passage collection. In this paper, we describe how we coupled Anseriniâs information retrieval toolkit with a BERT-based classifier to build a state-of-the-art end-to-end retrieval system. Our only submission which is based on a RoBERTa large pretrained model achieves for (1)a ncdg@10 of .6558 and .6295 for passages and documents respectively and for (2) a ndcg@10 of .6614 and .6404 for passages and documents respectively
Ensemble of deep masked language models for effective named entity recognition in health and life science corpora
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domainsâbiology, chemistry, and medicineâavailable in different languagesâEnglish and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERTbased baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains
Classification of hierarchical text using geometric deep learning:
We consider the hierarchical representation of documents as graphs and use geometric deep learning to classify them into different categories. While graph neural networks can efficiently handle the variable structure of hierarchical documents using the permutation invariant message passing operations, we show that we can gain extra performance improvements using our proposed selective graph pooling operation that arises from the fact that some parts of the hierarchy are invariable across different documents. We applied our model to classify clinical trial (CT) protocols into completed and terminated categories. We use bag-of-words based, as well as pre-trained transformer-based embeddings to featurize the graph nodes, achieving f1-scores ' 0:85 on a publicly available large scale CT registry of around 360K protocols. We further demonstrate how the selective pooling can add insights into the CT termination status prediction. We make the source code and dataset splits accessible
BiTeM at WNUT 2020 shared task-1 ::named entity recognition over wet lab protocols using an ensemble of contextual language models
Recent improvements in machine-reading technologies attracted much attention to automation problems and their possibilities. In this context, WNUT 2020 introduces a Name Entity Recognition (NER) task based on wet laboratory procedures. In this paper, we present a 3-step method based on deep neural language models that reported the best overall exact match F1-score (77.99%) of the competition. By fine-tuning 10 times, 10 different pretrained language models, this work shows the advantage of having more models in an ensemble based on a majority of votes strategy. On top of that, having 100 different models allowed us to analyse the combinations of ensemble that demonstrated the impact of having multiple pretrained models versus fine-tuning a pretrained model multiple times
UPCLASS ::a deep learning-based classifier for UniProtKB entry publications
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession
SIB Text Mining at TREC 2019 Deep Learning Track: Working Note
The TREC 2019 Deep Learning task aims at studying information retrieval in a large training data regime. It includes two tasks: the document ranking task (1) and the passage ranking task (2). Both of these tasks had a full ranking (a) and reranking (b) subtasks. The SIB Text Mining group participated at the full document ranking subtask (1a). In order to retrieve pertinent documents in the 3.2 million documents corpus, our strategy was two-fold. At first, we used a BM25 model to retrieve a subset of documents relevant to a query. We also tried to improve recall by using query expansion. The second step consisted in reranking the retrieved subset using an original model, so-called query2doc. This model, which has been designed to predict if a query-document pair was a good candidate to be ranked in position #1, was trained using the training dataset provided for the task. Our baseline, which is basically a BM25 ranking performed the best and achieve a MAP of 0.2892. Results of the query2doc run clearly indicates that the query2doc model could not learn any meaningful relationship. More precisely, to explain such a failure, we hypothesize that using documents returned by our baseline model as negative items confused our model. As future steps, it will be interesting to take into account features such as the documentâs BM25 score as well as the number of times a documentâs URL is mentioned in the corpus and use them along with learning to rank algorithms