9 research outputs found
Discovering context-specific relationships from biological literature by using multi-level context terms
<p>Abstract</p> <p>Background</p> <p>The Swanson's ABC model is powerful to infer hidden relationships buried in biological literature. However, the model is inadequate to infer relations with context information. In addition, the model generates a very large amount of candidates from biological text, and it is a semi-automatic, labor-intensive technique requiring human expert's manual input. To tackle these problems, we incorporate context terms to infer relations between AB interactions and BC interactions.</p> <p>Methods</p> <p>We propose 3 steps to discover meaningful hidden relationships between drugs and diseases: 1) multi-level (gene, drug, disease, symptom) entity recognition, 2) interaction extraction (drug-gene, gene-disease) from literature, 3) context vector based similarity score calculation. Subsequently, we evaluate our hypothesis with the datasets of the "Alzheimer's disease" related 77,711 PubMed abstracts. As golden standards, PharmGKB and CTD databases are used. Evaluation is conducted in 2 ways: first, comparing precision of the proposed method and the previous method and second, analysing top 10 ranked results to examine whether highly ranked interactions are truly meaningful or not.</p> <p>Results</p> <p>The results indicate that context-based relation inference achieved better precision than the previous ABC model approach. The literature analysis also shows that interactions inferred by the context-based approach are more meaningful than interactions by the previous ABC model.</p> <p>Conclusions</p> <p>We propose a novel interaction inference technique that incorporates context term vectors into the ABC model to discover meaningful hidden relationships. By utilizing multi-level context terms, our model shows better performance than the previous ABC model.</p
Processing SPARQL queries with regular expressions in RDF databases
Background: As the Resource Description Framework (RDF) data model is widely used for modeling and sharing a lot of online bioinformatics resources such as Uniprot (dev.isb-sib.ch/projects/uniprot-rdf) or Bio2RDF (bio2rdf.org), SPARQL - a W3C recommendation query for RDF databases - has become an important query language for querying the bioinformatics knowledge bases. Moreover, due to the diversity of users' requests for extracting information from the RDF data as well as the lack of users' knowledge about the exact value of each fact in the RDF databases, it is desirable to use the SPARQL query with regular expression patterns for querying the RDF data. To the best of our knowledge, there is currently no work that efficiently supports regular expression processing in SPARQL over RDF databases. Most of the existing techniques for processing regular expressions are designed for querying a text corpus, or only for supporting the matching over the paths in an RDF graph.
Results: In this paper, we propose a novel framework for supporting regular expression processing in SPARQL query. Our contributions can be summarized as follows. 1) We propose an efficient framework for processing SPARQL queries with regular expression patterns in RDF databases. 2) We propose a cost model in order to adapt the proposed framework in the existing query optimizers. 3) We build a prototype for the proposed framework in C++ and conduct extensive experiments demonstrating the efficiency and effectiveness of our technique.
Conclusions: Experiments with a full-blown RDF engine show that our framework outperforms the existing ones by up to two orders of magnitude in processing SPARQL queries with regular expression patterns.X113sciescopu
On the Use of Parsing for Named Entity Recognition
[Abstract] Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.Xunta de Galicia; ED431C 2020/11Xunta de Galicia; ED431G 2019/01This work has been funded by MINECO, AEI and FEDER of UE through the ANSWER-ASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the Consellería de Educación, Universidade e Formación Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the Secretaría Xeral de Universidades (Ref. ED431G 2019/01). Carlos Gómez-Rodríguez has also received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, Grant No. 714150)
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
The Detection of Contradictory Claims in Biomedical Abstracts
Research claims in the biomedical domain are not always consistent, and may even be contradictory. This thesis explores contradictions between research claims in order to
determine whether or not it is possible to develop a solution to automate the detection of such phenomena. Such a solution will help decision-makers, including researchers, to alleviate the effects of contradictory claims on their decisions.
This study develops two methodologies to construct corpora of contradictions. The first methodology utilises systematic reviews to construct a manually-annotated corpus
of contradictions. The second methodology uses a different approach to construct a corpus of contradictions which does not rely on human annotation. This methodology is proposed to overcome the limitations of the manual annotation approach.
Moreover, this thesis proposes a pipeline to detect contradictions in abstracts. The pipeline takes a question and a list of research abstracts which may contain answers
to it. The output of the pipeline is a list of sentences extracted from abstracts which answer the question, where each sentence is annotated with an assertion value with
respect to the question. Claims which feature opposing assertion values are considered as potentially contradictory claims.
The research demonstrates that automating the detection of contradictory claims in research abstracts is a feasible problem
In Search of a Common Thread: Enhancing the LBD Workflow with a view to its Widespread Applicability
Literature-Based Discovery (LBD) research focuses on discovering implicit knowledge
linkages in existing scientific literature to provide impetus to innovation and research
productivity. Despite significant advancements in LBD research, previous studies contain
several open problems and shortcomings that are hindering its progress. The overarching
goal of this thesis is to address these issues, not only to enhance the discovery
component of LBD, but also to shed light on new directions that can further strengthen
the existing understanding of the LBD work
ow. In accordance with this goal, the thesis
aims to enhance the LBD work
ow with a view to ensuring its widespread applicability.
The goal of widespread applicability is twofold. Firstly, it relates to the adaptability of
the proposed solutions to a diverse range of problem settings. These problem settings
are not necessarily application areas that are closely related to the LBD context, but
could include a wide range of problems beyond the typical scope of LBD, which has traditionally
been applied to scientific literature. Adapting the LBD work
ow to problems
outside the typical scope of LBD is a worthwhile goal, since the intrinsic objective of
LBD research, which is discovering novel linkages in text corpora is valid across a vast
range of problem settings.
Secondly, the idea of widespread applicability also denotes the capability of the proposed
solutions to be executed in new environments. These `new environments' are various
academic disciplines (i.e., cross-domain knowledge discovery) and publication languages
(i.e., cross-lingual knowledge discovery). The application of LBD models to new environments
is timely, since the massive growth of the scientific literature has engendered
huge challenges to academics, irrespective of their domain.
This thesis is divided into five main research objectives that address the following topics:
literature synthesis, the input component, the discovery component, reusability, and
portability. The objective of the literature synthesis is to address the gaps in existing
LBD reviews by conducting the rst systematic literature review. The input component
section aims to provide generalised insights on the suitability of various input types in the
LBD work
ow, focusing on their role and potential impact on the information retrieval
cycle of LBD.
The discovery component section aims to intermingle two research directions that have
been under-investigated in the LBD literature, `modern word embedding techniques'
and `temporal dimension' by proposing diachronic semantic inferences. Their potential
positive in
uence in knowledge discovery is veri ed through both direct and indirect
uses. The reusability section aims to present a new, distinct viewpoint on these LBD
models by verifying their reusability in a timely application area using a methodical reuse
plan. The last section, portability, proposes an interdisciplinary LBD framework that
can be applied to new environments. While highly cost-e cient and easily pluggable, this framework also gives rise to a new perspective on knowledge discovery through its
generalisable capabilities.
Succinctly, this thesis presents novel and distinct viewpoints to accomplish five main
research objectives, enhancing the existing understanding of the LBD work
ow. The
thesis offers new insights which future LBD research could further explore and expand
to create more eficient, widely applicable LBD models to enable broader community
benefits.Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 202
Automatic Population of Structured Reports from Narrative Pathology Reports
There are a number of advantages for the use of structured pathology reports: they can ensure the accuracy and completeness of pathology reporting; it is easier for the referring doctors to glean pertinent information from them. The goal of this thesis is to extract pertinent information from free-text pathology reports and automatically populate structured reports for cancer diseases and identify the commonalities and differences in processing principles to obtain maximum accuracy. Three pathology corpora were annotated with entities and relationships between the entities in this study, namely the melanoma corpus, the colorectal cancer corpus and the lymphoma corpus. A supervised machine-learning based-approach, utilising conditional random fields learners, was developed to recognise medical entities from the corpora. By feature engineering, the best feature configurations were attained, which boosted the F-scores significantly from 4.2% to 6.8% on the training sets. Without proper negation and uncertainty detection, the quality of the structured reports will be diminished. The negation and uncertainty detection modules were built to handle this problem. The modules obtained overall F-scores ranging from 76.6% to 91.0% on the test sets. A relation extraction system was presented to extract four relations from the lymphoma corpus. The system achieved very good performance on the training set, with 100% F-score obtained by the rule-based module and 97.2% F-score attained by the support vector machines classifier. Rule-based approaches were used to generate the structured outputs and populate them to predefined templates. The rule-based system attained over 97% F-scores on the training sets. A pipeline system was implemented with an assembly of all the components described above. It achieved promising results in the end-to-end evaluations, with 86.5%, 84.2% and 78.9% F-scores on the melanoma, colorectal cancer and lymphoma test sets respectively
Robust Entity Linking in Heterogeneous Domains
Entity Linking is the task of mapping terms in arbitrary documents to entities in a knowledge base by identifying the correct semantic meaning. It is applied in the extraction of structured data in RDF (Resource Description Framework) from textual documents, but equally so in facilitating artificial intelligence applications, such as Semantic Search, Reasoning and Question and Answering. Most existing Entity Linking systems were optimized for specific domains (e.g., general domain, biomedical domain), knowledge base types (e.g., DBpedia, Wikipedia), or document structures (e.g., tables) and types (e.g., news articles, tweets). This led to very specialized systems that lack robustness and are only applicable for very specific tasks. In this regard, this work focuses on the research and development of a robust Entity Linking system in terms of domains, knowledge base types, and document structures and types.
To create a robust Entity Linking system, we first analyze the following three crucial components of an Entity Linking algorithm in terms of robustness criteria: (i) the underlying knowledge base, (ii) the entity relatedness measure, and (iii) the textual context matching technique. Based on the analyzed components, our scientific contributions are three-fold. First, we show that a federated approach leveraging knowledge from various knowledge base types can significantly improve robustness in Entity Linking systems. Second, we propose a new state-of-the-art, robust entity relatedness measure for topical coherence computation based on semantic entity embeddings. Third, we present the neural-network-based approach Doc2Vec as a textual context matching technique for robust Entity Linking.
Based on our previous findings and outcomes, our main contribution in this work is DoSeR (Disambiguation of Semantic Resources). DoSeR is a robust, knowledge-base-agnostic Entity Linking framework that extracts relevant entity information from multiple knowledge bases in a fully automatic way. The integrated algorithm represents a collective, graph-based approach that utilizes semantic entity and document embeddings for entity relatedness and textual context matching computation. Our evaluation shows, that DoSeR achieves state-of-the-art results over a wide range of different document structures (e.g., tables), document types (e.g., news documents) and domains (e.g., general domain, biomedical domain). In this context, DoSeR outperforms all other (publicly available) Entity Linking algorithms on most data sets
La bioinformática al servicio de la genómica
Este trabajo de tesis aborda distintos ámbitos de aplicación de técnicas bioinformáticas a la resolución de problemas surgidos del manejo, análisis, almacenamiento y consulta de grandes volúmenes de datos genómicos. Los principales retos a los que esta tesis ha tratado de dar respuesta han sido los siguientes:
- Procesar la información más básica de las tecnologías de genotipado de alto rendimiento, a fin de permitir obtener de manera rápida y sencilla una serie de parámetros y estadísticas básicas características de un experimento independientemente de la tecnología elegida.
- Facilitar la publicación y consulta de resultados de genotipado a baja y media escala, tanto de SNPs como de STRs, así como su interacción con los repositorios de variabilidad accesibles públicamente.
- Estudiar la viabilidad de gestionar localmente un repositorio propio de variabilidad humana basado en los recursos disponibles, tanto de información externa como de infraestructura interna.
- Transferir el conocimiento obtenido. Aportar herramientas existentes o soluciones ad hoc a los problemas que pueda presentar la investigación genética en el campo de la biología computacional