8,992 research outputs found
Harvesting Entities from the Web Using Unique Identifiers -- IBEX
In this paper we study the prevalence of unique entity identifiers on the
Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs
(for documents), email addresses, and others. We show how these identifiers can
be harvested systematically from Web pages, and how they can be associated with
human-readable names for the entities at large scale.
Starting with a simple extraction of identifiers and names from Web pages, we
show how we can use the properties of unique identifiers to filter out noise
and clean up the extraction result on the entire corpus. The end result is a
database of millions of uniquely identified entities of different types, with
an accuracy of 73--96% and a very high coverage compared to existing knowledge
bases. We use this database to compute novel statistics on the presence of
products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A.
Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting
Entities from the Web Using Unique Identifiers. WebDB workshop, 201
Information Extraction from Text for Improving Research on Small Molecules and Histone Modifications
The cumulative number of publications, in particular in the life sciences, requires efficient methods for the automated extraction of information and semantic information retrieval. The recognition and identification of information-carrying units in text – concept denominations and named entities – relevant to a certain domain is a fundamental step. The focus of this thesis lies on the recognition of chemical entities and the new biological named entity type histone modifications, which are both important in the field of drug discovery. As the emergence of new research fields as well as the discovery and generation of novel entities goes along with the coinage of new terms, the perpetual adaptation of respective named entity recognition approaches to new domains is an important step for information extraction. Two methodologies have been investigated in this concern: the state-of-the-art machine learning method, Conditional Random Fields (CRF), and an approximate string search method based on dictionaries. Recognition methods that rely on dictionaries are strongly dependent on the availability of entity terminology collections as well as on its quality. In the case of chemical entities the terminology is distributed over more than 7 publicly available data sources. The join of entries and accompanied terminology from selected resources enables the generation of a new dictionary comprising chemical named entities. Combined with the automatic processing of respective terminology – the dictionary curation – the recognition performance reached an F1 measure of 0.54. That is an improvement by 29 % in comparison to the raw dictionary. The highest recall was achieved for the class of TRIVIAL-names with 0.79. The recognition and identification of chemical named entities provides a prerequisite for the extraction of related pharmacological relevant information from literature data. Therefore, lexico-syntactic patterns were defined that support the automated extraction of hypernymic phrases comprising pharmacological function terminology related to chemical compounds. It was shown that 29-50 % of the automatically extracted terms can be proposed for novel functional annotation of chemical entities provided by the reference database DrugBank. Furthermore, they are a basis for building up concept hierarchies and ontologies or for extending existing ones. Successively, the pharmacological function and biological activity concepts obtained from text were included into a novel descriptor for chemical compounds. Its successful application for the prediction of pharmacological function of molecules and the extension of chemical classification schemes, such as the the Anatomical Therapeutic Chemical (ATC), is demonstrated. In contrast to chemical entities, no comprehensive terminology resource has been available for histone modifications. Thus, histone modification concept terminology was primary recognized in text via CRFs with a F1 measure of 0.86. Subsequent, linguistic variants of extracted histone modification terms were mapped to standard representations that were organized into a newly assembled histone modification hierarchy. The mapping was accomplished by a novel developed term mapping approach described in the thesis. The combination of term recognition and term variant resolution builds up a new procedure for the assembly of novel terminology collections. It supports the generation of a term list that is applicable in dictionary-based methods. For the recognition of histone modification in text it could be shown that the named entity recognition method based on dictionaries is superior to the used machine learning approach. In conclusion, the present thesis provides techniques which enable an enhanced utilization of textual data, hence, supporting research in epigenomics and drug discovery
Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT
We hypothesize that large language models (LLMs) based on the transformer
architecture can enable automated detection of clinical phenotype terms,
including terms not documented in the HPO. In this study, we developed two
types of models: PhenoBCBERT, a BERT-based model, utilizing Bio+Clinical BERT
as its pre-trained model, and PhenoGPT, a GPT-based model that can be
initialized from diverse GPT models, including open-source versions such as
GPT-J, Falcon, and LLaMA, as well as closed-source versions such as GPT-3 and
GPT-3.5. We compared our methods with PhenoTagger, a recently developed HPO
recognition tool that combines rule-based and deep learning methods. We found
that our methods can extract more phenotype concepts, including novel ones not
characterized by HPO. We also performed case studies on biomedical literature
to illustrate how new phenotype information can be recognized and extracted. We
compared current BERT-based versus GPT-based models for phenotype tagging, in
multiple aspects including model architecture, memory usage, speed, accuracy,
and privacy protection. We also discussed the addition of a negation step and
an HPO normalization layer to the transformer models for improved HPO term
tagging. In conclusion, PhenoBCBERT and PhenoGPT enable the automated discovery
of phenotype terms from clinical notes and biomedical literature, facilitating
automated downstream tasks to derive new biological insights on human diseases
Linking social media, medical literature, and clinical notes using deep learning.
Researchers analyze data, information, and knowledge through many sources, formats, and methods. The dominant data format includes text and images. In the healthcare industry, professionals generate a large quantity of unstructured data. The complexity of this data and the lack of computational power causes delays in analysis. However, with emerging deep learning algorithms and access to computational powers such as graphics processing unit (GPU) and tensor processing units (TPUs), processing text and images is becoming more accessible. Deep learning algorithms achieve remarkable results in natural language processing (NLP) and computer vision. In this study, we focus on NLP in the healthcare industry and collect data not only from electronic medical records (EMRs) but also medical literature and social media. We propose a framework for linking social media, medical literature, and EMRs clinical notes using deep learning algorithms. Connecting data sources requires defining a link between them, and our key is finding concepts in the medical text. The National Library of Medicine (NLM) introduces a Unified Medical Language System (UMLS) and we use this system as the foundation of our own system. We recognize social media’s dynamic nature and apply supervised and semi-supervised methodologies to generate concepts. Named entity recognition (NER) allows efficient extraction of information, or entities, from medical literature, and we extend the model to process the EMRs’ clinical notes via transfer learning. The results include an integrated, end-to-end, web-based system solution that unifies social media, literature, and clinical notes, and improves access to medical knowledge for the public and experts
Relation Extraction in underexplored biomedical domains: A diversity-optimised sampling and synthetic data generation approach
The sparsity of labelled data is an obstacle to the development of Relation
Extraction models and the completion of databases in various biomedical areas.
While being of high interest in drug-discovery, the natural-products
literature, reporting the identification of potential bioactive compounds from
organisms, is a concrete example of such an overlooked topic. To mark the start
of this new task, we created the first curated evaluation dataset and extracted
literature items from the LOTUS database to build training sets. To this end,
we developed a new sampler inspired by diversity metrics in ecology, named
Greedy Maximum Entropy sampler, or GME-sampler
(https://github.com/idiap/gme-sampler). The strategic optimization of both
balance and diversity of the selected items in the evaluation set is important
given the resource-intensive nature of manual curation. After quantifying the
noise in the training set, in the form of discrepancies between the input
abstracts text and the expected output labels, we explored different strategies
accordingly. Framing the task as an end-to-end Relation Extraction, we
evaluated the performance of standard fine-tuning as a generative task and
few-shot learning with open Large Language Models (LLaMA 7B-65B). In addition
to their evaluation in few-shot settings, we explore the potential of open
Large Language Models (Vicuna-13B) as synthetic data generator and propose a
new workflow for this purpose. All evaluated models exhibited substantial
improvements when fine-tuned on synthetic abstracts rather than the original
noisy data. We provide our best performing (f1-score=59.0) BioGPT-Large model
for end-to-end RE of natural-products relationships along with all the
generated synthetic data and the evaluation dataset. See more details at
https://github.com/idiap/abroad-re
A Semi-Supervised Information Extraction Framework for Large Redundant Corpora
The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy
A Semi-Supervised Information Extraction Framework for Large Redundant Corpora
The vast majority of text freely available on the Internet is not available in a form that computers can understand. There have been numerous approaches to automatically extract information from human- readable sources. The most successful attempts rely on vast training sets of data. Others have succeeded in extracting restricted subsets of the available information. These approaches have limited use and require domain knowledge to be coded into the application. The current thesis proposes a novel framework for Information Extraction. From large sets of documents, the system develops statistical models of the data the user wishes to query which generally avoid the lim- itations and complexity of most Information Extractions systems. The framework uses a semi-supervised approach to minimize human input. It also eliminates the need for external Named Entity Recognition systems by relying on freely available databases. The final result is a query-answering system which extracts information from large corpora with a high degree of accuracy
De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks
Unstructured textual data are at the heart of health systems: liaison letters
between doctors, operating reports, coding of procedures according to the
ICD-10 standard, etc. The details included in these documents make it possible
to get to know the patient better, to better manage him or her, to better study
the pathologies, to accurately remunerate the associated medical acts\ldots All
this seems to be (at least partially) within reach of today by artificial
intelligence techniques. However, for obvious reasons of privacy protection,
the designers of these AIs do not have the legal right to access these
documents as long as they contain identifying data. De-identifying these
documents, i.e. detecting and deleting all identifying information present in
them, is a legally necessary step for sharing this data between two
complementary worlds. Over the last decade, several proposals have been made to
de-identify documents, mainly in English. While the detection scores are often
high, the substitution methods are often not very robust to attack. In French,
very few methods are based on arbitrary detection and/or substitution rules. In
this paper, we propose a new comprehensive de-identification method dedicated
to French-language medical documents. Both the approach for the detection of
identifying elements (based on deep learning) and their substitution (based on
differential privacy) are based on the most proven existing approaches. The
result is an approach that effectively protects the privacy of the patients at
the heart of these medical documents. The whole approach has been evaluated on
a French language medical dataset of a French public hospital and the results
are very encouraging
- …