1,460 research outputs found

    Semantic annotation of clinical questionnaires to support personalized medicine

    Get PDF
    Tese de Mestrado, Bioinformática e Biologia Computacional, 2022, Universidade de Lisboa, Faculdade de CiênciasAtualmente estamos numa era global de constante evolução tecnológica, e uma das áreas que têm beneficiado com isso é a medicina, uma vez que com integração da vertente tecnológica na medicina, tem vindo a ter um papel cada vez mais importante quer do ponto de vista dos médicos quer do ponto de vista dos pacientes. Como resultado de melhores ferramentas que permitam melhorar o exercício das funções dos médicos, estão se a criar condições para que os pacientes possam ter um melhor acompanhamento, entendimento e atualização em tempo real da sua condição clínica. O setor dos Cuidados de Saúde é responsável pelas novidades que surgem quase diariamente e que permitem melhorar a experiência do paciente e o modo como os médicos podem tirar proveito da informação que os dados contêm em prol de uma validação mais célere e eficaz. Este setor tem gerado um volume cada vez mais maciço de dados, entre os quais relatórios médicos, registos de sensores inerciais, gravações de consultas, imagens, vídeos e avaliações médicas nas quais se inserem os questionários e as escalas clínicas que prometem aos pacientes um melhor acompanhamento do seu estado de saúde, no entanto o seu enorme volume, distribuição e a grande heterogeneidade dificulta o processamento e análise. A integração deste tipo de dados é um desafio, uma vez que têm origens em diversas fontes e uma heterogeneidade semântica bastante significativa; a integração semântica de dados biomédicos resulta num desenvolvimento de uma rede semântica biomédica que relaciona conceitos entre diversas fontes o que facilita a tradução de descobertas científicas ajudando na elaboração de análises e conclusões mais complexas para isso é crucial que se atinja a interoperabilidade semântica dos dados. Este é um passo muito importante que permite a interação entre diferentes conjuntos de dados clínicos dentro do mesmo sistema de informação ou entre sistemas diferentes. Esta integração permite às ferramentas de análise e interface com os dados trabalhar sobre uma visão integrada e holística dos dados, o que em última análise permite aos clínicos um acompanhamento mais detalhado e personalizado dos seus pacientes. Esta dissertação foi desenvolvida no LASIGE e em colaboração com o Campus Neurológico Sénior e faz parte de um grande projeto que explora o fornecimento de mais e melhores dados tanto a clínicos como a pacientes. A base deste projeto assenta numa aplicação web, o DataPark que possui uma plataforma que permite ao utilizador navegar por áreas clinicas entre as quais a nutrição, fisioterapia, terapia ocupacional, terapia da fala e neuropsicologia, em que cada uma delas que alberga baterias de testes com diversos questionários e escalas clínicas de avaliação. Este tipo de avaliação clínica facilita imenso o trabalho do médico uma vez que permite que sejam implementadas à distância uma vez que o paciente pode responder remotamente, estas respostas ficam guardadas no DataPark permitindo ao médico fazer um rastreamento do status do paciente ao longo do tempo em relação a uma determinada escala. No entanto o modo como o DataPark foi desenvolvido limita uma visão do médico orientada ao questionário, ou seja o médico que acompanha o paciente quando quer ter a visão do mesmo como um todo tem esta informação espalhada e dividida por estes diferentes questionários e tem de os ir ver a todos um a um para ter a noção do status do paciente. Esta dissertação pretende fazer face a este desafio construindo um algoritmo que decomponha todas as perguntas dos diferentes questionários e permita a sua integração semântica. Isto com o objectivo de permitir ao médico ter um visão holística orientada por conceito clínico. Procedeu-se então à extração de toda a base de dados presente no DataPark, sendo esta a fonte de dados sobre a qual este trabalho se baseou, frisando que originalmente existem muitos dados em Português que terão de ser traduzidos automaticamente. Com uma análise de alto nível (numa fase inicial) sobre os questionários da base de dados, iniciou-se a construção de um modelo semântico que pudesse descrever os dados presentes nos questionários e escalas. Assim de uma forma manual foi feito um levantamento de todos os conceitos clínicos que se conseguiu identificar num sub conjunto de questionários, mais concretamente 15 com os 5 mais respondidos em relação à Doença de parkinson, os 5 mais respondidos em relação à doença de AVC e os 5 mais respondidos que não estejam associados a uma única patologia em específico. Este modelo foi melhorado e evoluiu em conjunto com uma equipa de 12 médicos e terapeutas do CNS ao longo de 7 reuniões durante as quais foi levado a cabo um workshop de validação que permitiu dotar o modelo construído de uma fiabilidade elevada. Em paralelo procedeu-se à elaboração de 2 estudo: (i) um estudo que consistia em avaliar com qual ou quais ontologias se obtém a maior cobertura dos dados do sub conjunto de 15 questionários. A conclusão a que se chegou foi que o conjunto de ontologias que nos conferia mais segurança é constituído pelas ontologias LOINC, NCIT, SNOMED e OCHV, conjunto esse foi utilizado daqui em diante; (ii) outro estudo procurou aferir qual a ferramenta de tradução automática(Google Translator ou Microsoft Translator) que confere uma segurança maior, para isso procedeu-se à tradução completa de 3 questionários que apesar de estar na base de dados no idioma português, tem a sua versão original em inglês. Isto permitiu-nos traduzir estes 3 questionários de português para inglês e avaliar em qual das duas ferramentas se obteve uma melhor performance. O Microsoft Translator apresentou com uma diferença pequena um desempenho superior, sendo portanto a ferramenta de tradução automática escolhida para integrar o nosso algoritmo. Concluídos estes 2 estudos temos assim o conjunto de dados uniformizado numa só linguagem, e o conjunto de ontologias escolhidas para a anotação semântica. Para entender esta fase do trabalho há que entender que ontologias são poderosas ferramentas computacionais que consistem num conjunto de conceitos ou termos, que nomeiam e definem as entidades presentes num certo domínio de interesse, no ramo da biomedicina são designadas por ontologias biomédicas. O uso de ontologias biomédicas confere uma grande utilidade na partilha, recuperação e na extração de informação na biomedicina tendo um papel crucial para a interoperabilidade semântica que é exatamente o nosso objectivo final. Assim sendo procedeu-se à anotação semântica das questões do sub-conjunto de 15 questionários, uma anotação semântica é um processo que associa formalmente o alvo textual a um conceito/termo, podendo estabelecer desta forma pontes entre documentos/texto-alvos diferentes que abordam o mesmo conceito. Ou seja, uma anotação semântica é associar um termo de uma determinada ontologia a um conceito presente no texto alvo. Imaginando que o texto alvo são diferentes perguntas de vários questionários, é natural encontrar diferentes questões de diferentes áreas de diagnóstico que estejam conectados por termos ontológicos em comum. Depois da anotação completada é feita a integração do modelo semântico, com o algoritmo desenvolvido com o conjunto de ontologias e ainda com os dados dos pacientes. Desta forma sabemos que um determinado paciente respondeu a várias perguntas que abordam um mesmo conceito, essas perguntas estão interligadas semanticamente uma vez que têm o mesmo conceito mapeado. A nível de performance geral tanto os processos tradução como de anotação tiveram um desempenho aceitável, onde a nivel de tradução se atingiu 78% accuracy, 76% recall e uma F-mesure de 0.77 e ao nível da performance de anotação obteve-se 87% de anotações bem conseguidas. Portanto num cômputo geral consegue-se atingir o principal objectivo que era a obtenção holística integrada com o modelo semântico e os dados do DataPark(Questionários e pacientes).Healthcare is a multi-domain area, with professionals from different areas often collaborating to provide patients with the best possible care. Neurological and neurodegenerative diseases are especially so, with multiple areas, including neurology, psychology, nursing, physical therapy, speech therapy and others coming together to support these patients. The DataPark application allows healthcare providers to store, manage and analyse information about patients with neurological disorders from different perspectives including evaluation scales and questionnaires. However, the application does not provide a holistic view of the patient status because it is split across different domains and clinical scales. This work proposes a methodology for the semantic integration of this data. It developed the data scaffolding to afford a holistic view of the patient status that is concept-oriented rather than scale or test battery oriented. A semantic model was developed in collaboration with healthcare providers from different areas, which was subsequently aligned with existing biomedical ontologies. The questionnaire and scale data was semantically annotated to this semantic model, with a translation step when the original data was in Portuguese. The process was applied to a subset of 15 scales with a manual evaluation of each process. The semantic model includes 204 concepts and 436 links to external ontologies. Translation achieved an accuracy of 78%, whereas the semantic annotation achieved 87%. The final integrated dataset covers 443 patients. Finally, applying the process of semantic annotation to the whole dataset, conditions are created for the process of semantic integration to occur, this process consists in crossing all questions from different questionnaires and establishing a connection between those that contain the same annotation. This work allows healthcare providers to assess patients in a more global fashion, integrating data collected from different scales and test batteries that evaluate the same or similar parameters

    Application of Neuroanatomical Ontologies for Neuroimaging Data Annotation

    Get PDF
    The annotation of functional neuroimaging results for data sharing and re-use is particularly challenging, due to the diversity of terminologies of neuroanatomical structures and cortical parcellation schemes. To address this challenge, we extended the Foundational Model of Anatomy Ontology (FMA) to include cytoarchitectural, Brodmann area labels, and a morphological cortical labeling scheme (e.g., the part of Brodmann area 6 in the left precentral gyrus). This representation was also used to augment the neuroanatomical axis of RadLex, the ontology for clinical imaging. The resulting neuroanatomical ontology contains explicit relationships indicating which brain regions are “part of” which other regions, across cytoarchitectural and morphological labeling schemas. We annotated a large functional neuroimaging dataset with terms from the ontology and applied a reasoning engine to analyze this dataset in conjunction with the ontology, and achieved successful inferences from the most specific level (e.g., how many subjects showed activation in a subpart of the middle frontal gyrus) to more general (how many activations were found in areas connected via a known white matter tract?). In summary, we have produced a neuroanatomical ontology that harmonizes several different terminologies of neuroanatomical structures and cortical parcellation schemes. This neuroanatomical ontology is publicly available as a view of FMA at the Bioportal website1. The ontological encoding of anatomic knowledge can be exploited by computer reasoning engines to make inferences about neuroanatomical relationships described in imaging datasets using different terminologies. This approach could ultimately enable knowledge discovery from large, distributed fMRI studies or medical record mining

    The CAP cancer protocols – a case study of caCORE based data standards implementation to integrate with the Cancer Biomedical Informatics Grid

    Get PDF
    BACKGROUND: The Cancer Biomedical Informatics Grid (caBIG™) is a network of individuals and institutions, creating a world wide web of cancer research. An important aspect of this informatics effort is the development of consistent practices for data standards development, using a multi-tier approach that facilitates semantic interoperability of systems. The semantic tiers include (1) information models, (2) common data elements, and (3) controlled terminologies and ontologies. The College of American Pathologists (CAP) cancer protocols and checklists are an important reporting standard in pathology, for which no complete electronic data standard is currently available. METHODS: In this manuscript, we provide a case study of Cancer Common Ontologic Representation Environment (caCORE) data standard implementation of the CAP cancer protocols and checklists model – an existing and complex paper based standard. We illustrate the basic principles, goals and methodology for developing caBIG™ models. RESULTS: Using this example, we describe the process required to develop the model, the technologies and data standards on which the process and models are based, and the results of the modeling effort. We address difficulties we encountered and modifications to caCORE that will address these problems. In addition, we describe four ongoing development projects that will use the emerging CAP data standards to achieve integration of tissue banking and laboratory information systems. CONCLUSION: The CAP cancer checklists can be used as the basis for an electronic data standard in pathology using the caBIG™ semantic modeling methodology

    An ontology to standardize research output of nutritional epidemiology : from paper-based standards to linked content

    Get PDF
    Background: The use of linked data in the Semantic Web is a promising approach to add value to nutrition research. An ontology, which defines the logical relationships between well-defined taxonomic terms, enables linking and harmonizing research output. To enable the description of domain-specific output in nutritional epidemiology, we propose the Ontology for Nutritional Epidemiology (ONE) according to authoritative guidance for nutritional epidemiology. Methods: Firstly, a scoping review was conducted to identify existing ontology terms for reuse in ONE. Secondly, existing data standards and reporting guidelines for nutritional epidemiology were converted into an ontology. The terms used in the standards were summarized and listed separately in a taxonomic hierarchy. Thirdly, the ontologies of the nutritional epidemiologic standards, reporting guidelines, and the core concepts were gathered in ONE. Three case studies were included to illustrate potential applications: (i) annotation of existing manuscripts and data, (ii) ontology-based inference, and (iii) estimation of reporting completeness in a sample of nine manuscripts. Results: Ontologies for food and nutrition (n = 37), disease and specific population (n = 100), data description (n = 21), research description (n = 35), and supplementary (meta) data description (n = 44) were reviewed and listed. ONE consists of 339 classes: 79 new classes to describe data and 24 new classes to describe the content of manuscripts. Conclusion: ONE is a resource to automate data integration, searching, and browsing, and can be used to assess reporting completeness in nutritional epidemiology

    An ontology of mechanisms of action in behaviour change interventions

    Get PDF
    BACKGROUND: Behaviour change interventions influence behaviour through causal processes called “mechanisms of action” (MoAs). Reports of such interventions and their evaluations often use inconsistent or ambiguous terminology, creating problems for searching, evidence synthesis and theory development. This inconsistency includes the reporting of MoAs. An ontology can help address these challenges by serving as a classification system that labels and defines MoAs and their relationships. The aim of this study was to develop an ontology of MoAs of behaviour change interventions. METHODS: To develop the MoA Ontology, we (1) defined the ontology’s scope; (2) identified, labelled and defined the ontology’s entities; (3) refined the ontology by annotating (i.e., coding) MoAs in intervention reports; (4) refined the ontology via stakeholder review of the ontology’s comprehensiveness and clarity; (5) tested whether researchers could reliably apply the ontology to annotate MoAs in intervention evaluation reports; (6) refined the relationships between entities; (7) reviewed the alignment of the MoA Ontology with other relevant ontologies, (8) reviewed the ontology’s alignment with the Theories and Techniques Tool; and (9) published a machine-readable version of the ontology. RESULTS: An MoA was defined as “a process that is causally active in the relationship between a behaviour change intervention scenario and its outcome behaviour”. We created an initial MoA Ontology with 261 entities through Steps 2-5. Inter-rater reliability for annotating study reports using these entities was α=0.68 (“acceptable”) for researchers familiar with the ontology and α=0.47 for researchers unfamiliar with it. As a result of additional revisions (Steps 6-8), 21 further entities were added to the ontology resulting in 282 entities organised in seven hierarchical levels. CONCLUSIONS: The MoA Ontology extensively captures MoAs of behaviour change interventions. The ontology can serve as a controlled vocabulary for MoAs to consistently describe and synthesise evidence about MoAs across diverse sources

    Knowledge Graphs and Large Language Models for Intelligent Applications in the Tourism Domain

    Get PDF
    In the current era of big data, the World Wide Web is transitioning from being merely a repository of content to a complex web of data. Two pivotal technologies underpinning this shift are Knowledge Graphs (KGs) and Data Lakes. Concurrently, Artificial Intelligence has emerged as a potent means to leverage data, creating knowledge and pioneering new tools across various sectors. Among these advancements, Large Language Models (LLM) stand out as transformative technologies in many domains. This thesis delves into an integrative exploration, juxtaposing the structured world of KGs and the raw data reservoirs of Data Lakes, together with a focus on harnessing LLM to derive meaningful insights in the domain of tourism. Starting with an exposition on the importance of KGs in the present digital milieu, the thesis delineates the creation and management of KGs that utilize entities and their relations to represent intricate data patterns within the tourism sector. In this context, we introduce a semi-automatic methodology for generating a Tourism Knowledge Graph (TKG) and a novel Tourism Analytics Ontology (TAO). Through integrating information from enterprise data lakes with public knowledge graphs, the thesis illustrates the creation of a comprehensive semantic layer built upon the raw data, demonstrating versatility and scalability. Subsequently, we present an in-depth investigation into transformer-based language models, emphasizing their potential and limitations. Addressing the exigency for domain-specific knowledge enrichment, we conduct a methodical study on knowledge enhancement strategies for transformers based language models. The culmination of this thesis is the presentation of an innovative method that fuses large language models with domain-specific knowledge graphs, targeting the optimisation of hospitality offers. This approach integrates domain KGs with feature engineering, enriching data representation in LLMs. Our scientific contributions span multiple dimensions: from devising methodologies for KG construction, especially in tourism, to the design and implementation of a novel ontology; from the analysis and comparison of techniques for enriching LLMs with specialized knowledge, to deploying such methods in a novel framework that effectively combines LLMs and KGs within the context of the tourism domain. In our research, we explore the potential benefits and challenges arising from the integration of knowledge engineering and artificial intelligence, with a specific emphasis on the tourism sector. We believe our findings offer a promising avenue and serve as a foundational platform for subsequent studies and practical implementations for the academic community and the tourism industry alike

    User-centered semantic dataset retrieval

    Get PDF
    Finding relevant research data is an increasingly important but time-consuming task in daily research practice. Several studies report on difficulties in dataset search, e.g., scholars retrieve only partial pertinent data, and important information can not be displayed in the user interface. Overcoming these problems has motivated a number of research efforts in computer science, such as text mining and semantic search. In particular, the emergence of the Semantic Web opens a variety of novel research perspectives. Motivated by these challenges, the overall aim of this work is to analyze the current obstacles in dataset search and to propose and develop a novel semantic dataset search. The studied domain is biodiversity research, a domain that explores the diversity of life, habitats and ecosystems. This thesis has three main contributions: (1) We evaluate the current situation in dataset search in a user study, and we compare a semantic search with a classical keyword search to explore the suitability of semantic web technologies for dataset search. (2) We generate a question corpus and develop an information model to figure out on what scientific topics scholars in biodiversity research are interested in. Moreover, we also analyze the gap between current metadata and scholarly search interests, and we explore whether metadata and user interests match. (3) We propose and develop an improved dataset search based on three components: (A) a text mining pipeline, enriching metadata and queries with semantic categories and URIs, (B) a retrieval component with a semantic index over categories and URIs and (C) a user interface that enables a search within categories and a search including further hierarchical relations. Following user centered design principles, we ensure user involvement in various user studies during the development process

    Foreword

    Get PDF
    The aim of this Workshop is to focus on building and evaluating resources used to facilitate biomedical text mining, including their design, update, delivery, quality assessment, evaluation and dissemination. Key resources of interest are lexical and knowledge repositories (controlled vocabularies, terminologies, thesauri, ontologies) and annotated corpora, including both task-specific resources and repositories reengineered from biomedical or general language resources. Of particular interest is the process of building annotated resources, including designing guidelines and annotation schemas (aiming at both syntactic and semantic interoperability) and relying on language engineering standards. Challenging aspects are updates and evolution management of resources, as well as their documentation, dissemination and evaluation

    Information Extraction from Text for Improving Research on Small Molecules and Histone Modifications

    Get PDF
    The cumulative number of publications, in particular in the life sciences, requires efficient methods for the automated extraction of information and semantic information retrieval. The recognition and identification of information-carrying units in text – concept denominations and named entities – relevant to a certain domain is a fundamental step. The focus of this thesis lies on the recognition of chemical entities and the new biological named entity type histone modifications, which are both important in the field of drug discovery. As the emergence of new research fields as well as the discovery and generation of novel entities goes along with the coinage of new terms, the perpetual adaptation of respective named entity recognition approaches to new domains is an important step for information extraction. Two methodologies have been investigated in this concern: the state-of-the-art machine learning method, Conditional Random Fields (CRF), and an approximate string search method based on dictionaries. Recognition methods that rely on dictionaries are strongly dependent on the availability of entity terminology collections as well as on its quality. In the case of chemical entities the terminology is distributed over more than 7 publicly available data sources. The join of entries and accompanied terminology from selected resources enables the generation of a new dictionary comprising chemical named entities. Combined with the automatic processing of respective terminology – the dictionary curation – the recognition performance reached an F1 measure of 0.54. That is an improvement by 29 % in comparison to the raw dictionary. The highest recall was achieved for the class of TRIVIAL-names with 0.79. The recognition and identification of chemical named entities provides a prerequisite for the extraction of related pharmacological relevant information from literature data. Therefore, lexico-syntactic patterns were defined that support the automated extraction of hypernymic phrases comprising pharmacological function terminology related to chemical compounds. It was shown that 29-50 % of the automatically extracted terms can be proposed for novel functional annotation of chemical entities provided by the reference database DrugBank. Furthermore, they are a basis for building up concept hierarchies and ontologies or for extending existing ones. Successively, the pharmacological function and biological activity concepts obtained from text were included into a novel descriptor for chemical compounds. Its successful application for the prediction of pharmacological function of molecules and the extension of chemical classification schemes, such as the the Anatomical Therapeutic Chemical (ATC), is demonstrated. In contrast to chemical entities, no comprehensive terminology resource has been available for histone modifications. Thus, histone modification concept terminology was primary recognized in text via CRFs with a F1 measure of 0.86. Subsequent, linguistic variants of extracted histone modification terms were mapped to standard representations that were organized into a newly assembled histone modification hierarchy. The mapping was accomplished by a novel developed term mapping approach described in the thesis. The combination of term recognition and term variant resolution builds up a new procedure for the assembly of novel terminology collections. It supports the generation of a term list that is applicable in dictionary-based methods. For the recognition of histone modification in text it could be shown that the named entity recognition method based on dictionaries is superior to the used machine learning approach. In conclusion, the present thesis provides techniques which enable an enhanced utilization of textual data, hence, supporting research in epigenomics and drug discovery
    corecore