28 research outputs found
Cohort Identification Using Semantic Web Technologies: Ontologies and Triplestores as Engines for Complex Computable Phenotyping
Electronic health record (EHR)-based computable phenotypes are algorithms used to identify individuals or populations with clinical conditions or events of interest within a clinical data repository. Due to a lack of EHR data standardization, computable phenotypes can be semantically ambiguous and difficult to share across institutions. In this research, I propose a new computable phenotyping methodological framework based on semantic web technologies, specifically ontologies, the Resource Description Framework (RDF) data format, triplestores, and Web Ontology Language (OWL) reasoning. My hypothesis is that storing and analyzing clinical data using these technologies can begin to address the critical issues of semantic ambiguity and lack of interoperability in the context of computable phenotyping. To test this hypothesis, I compared the performance of two variants of two computable phenotypes (for depression and rheumatoid arthritis, respectively). The first variant of each phenotype used a list of ICD-10-CM codes to define the condition; the second variant used ontology concepts from SNOMED and the Human Phenotype Ontology (HPO). After executing each variant of each phenotype against a clinical data repository, I compared the patients matched in each case to see where the different variants overlapped and diverged. Both the ontologies and the clinical data were stored in an RDF triplestore to allow me to assess the interoperability advantages of the RDF format for clinical data. All tested methods successfully identified cohorts in the data store, with differing rates of overlap and divergence between variants. Depending on the phenotyping use case, SNOMED and HPO’s ability to more broadly define many conditions due to complex relationships between their concepts may be seen as an advantage or a disadvantage. I also found that RDF triplestores do indeed provide interoperability advantages, despite being far less commonly used in clinical data applications than relational databases. Despite the fact that these methods and technologies are not “one-size-fits-all,” the experimental results are encouraging enough for them to (1) be put into practice in combination with existing phenotyping methods or (2) be used on their own for particularly well-suited use cases.Doctor of Philosoph
Ontop: answering SPARQL queries over relational databases
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL2QL and RDFS ontologies), and its support for all major relational databases
Data Quality Assessment and Anomaly Detection Via Map / Reduce and Linked Data: A Case Study in the Medical Domain
Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3%???26.9% in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach
Serviços de integração de dados para aplicações biomédicas
Doutoramento em Informática (MAP-i)In the last decades, the field of biomedical science has fostered
unprecedented scientific advances. Research is stimulated by the
constant evolution of information technology, delivering novel and
diverse bioinformatics tools. Nevertheless, the proliferation of new and
disconnected solutions has resulted in massive amounts of resources
spread over heterogeneous and distributed platforms. Distinct
data types and formats are generated and stored in miscellaneous
repositories posing data interoperability challenges and delays in
discoveries. Data sharing and integrated access to these resources
are key features for successful knowledge extraction.
In this context, this thesis makes contributions towards accelerating
the semantic integration, linkage and reuse of biomedical resources.
The first contribution addresses the connection of distributed and
heterogeneous registries. The proposed methodology creates a
holistic view over the different registries, supporting semantic
data representation, integrated access and querying. The second
contribution addresses the integration of heterogeneous information
across scientific research, aiming to enable adequate data-sharing
services. The third contribution presents a modular architecture to
support the extraction and integration of textual information, enabling
the full exploitation of curated data. The last contribution lies
in providing a platform to accelerate the deployment of enhanced
semantic information systems. All the proposed solutions were
deployed and validated in the scope of rare diseases.Nas últimas décadas, o campo das ciências biomédicas proporcionou
grandes avanços científicos estimulados pela constante evolução das
tecnologias de informação. A criação de diversas ferramentas na
área da bioinformática e a falta de integração entre novas soluções
resultou em enormes quantidades de dados distribuídos por diferentes
plataformas. Dados de diferentes tipos e formatos são gerados
e armazenados em vários repositórios, o que origina problemas de
interoperabilidade e atrasa a investigação. A partilha de informação
e o acesso integrado a esses recursos são características fundamentais
para a extração bem sucedida do conhecimento científico.
Nesta medida, esta tese fornece contribuições para acelerar a
integração, ligação e reutilização semântica de dados biomédicos. A
primeira contribuição aborda a interconexão de registos distribuídos e
heterogéneos. A metodologia proposta cria uma visão holística sobre
os diferentes registos, suportando a representação semântica de dados
e o acesso integrado. A segunda contribuição aborda a integração
de diversos dados para investigações científicas, com o objetivo de
suportar serviços interoperáveis para a partilha de informação. O
terceiro contributo apresenta uma arquitetura modular que apoia a
extração e integração de informações textuais, permitindo a exploração
destes dados. A última contribuição consiste numa plataforma web
para acelerar a criação de sistemas de informação semânticos. Todas
as soluções propostas foram validadas no âmbito das doenças raras
Contributions for the exploitation of Semantic Technologies in Industry 4.0
120 p.En este trabajo de investigación se promueve la utilización de las tecnologías semánticas, en el entorno de la Industria 4.0, a través de tres contribuciones enfocadas en temas correspondientes a la fabricación inteligente: las descripciones enriquecidas de componentes, la visualización y el análisis de los datos, y la implementación de la Industria 4.0 en PyMEs.La primera contribución es una ontología llamada ExtruOnt, la cual contiene descripciones semánticas de un tipo de máquina de fabricación (la extrusora). En esta ontología se describen los componentes, sus conexiones espaciales, sus características, sus representaciones en tres dimensiones y, finalmente, los sensores utilizados para capturar los datos. La segunda contribución corresponde a un sistema de consulta visual en el cual se utiliza la ontología ExtruOnt y una representación en 2D de la extrusora para facilitar a los expertos de dominio la visualización y la extracción de conocimiento sobre el proceso de fabricación de una manera rápida y sencilla. La tercera contribución consiste en una metodología para la implementación de la Industria 4.0 en PyMEs, orientada al ciclo de vida del cliente y potenciada por el uso de tecnologías Semánticas y tecnologías de renderizado 3D.Las contribuciones han sido desarrolladas, aplicadas y validadas bajo un escenario de fabricación real
Integrating institutional repositories into the Semantic Web
The Web has changed the face of scientific communication; and the Semantic Web promises new ways of adding value to research material by making it more accessible to automatic discovery, linking, and analysis. Institutional repositories contain a wealth of information which could benefit from the application of this technology. In this thesis I describe the problems inherent in the informality of traditional repository metadata, and propose a data model based on the Semantic Web which will support more efficient use of this data, with the aim of streamlining scientific communication and promoting efficient use of institutional research output