417 research outputs found

    Statistical inference from large-scale genomic data

    Get PDF
    This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series. This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis. Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome

    Statistical inference from large-scale genomic data

    Get PDF
    This thesis explores the potential of statistical inference methodologies in their applications in functional genomics. In essence, it summarises algorithmic findings in this field, providing step-by-step analytical methodologies for deciphering biological knowledge from large-scale genomic data, mainly microarray gene expression time series. This thesis covers a range of topics in the investigation of complex multivariate genomic data. One focus involves using clustering as a method of inference and another is cluster validation to extract meaningful biological information from the data. Information gained from the application of these various techniques can then be used conjointly in the elucidation of gene regulatory networks, the ultimate goal of this type of analysis. First, a new tight clustering method for gene expression data is proposed to obtain tighter and potentially more informative gene clusters. Next, to fully utilise biological knowledge in clustering validation, a validity index is defined based on one of the most important ontologies within the Bioinformatics community, Gene Ontology. The method bridges a gap in current literature, in the sense that it takes into account not only the variations of Gene Ontology categories in biological specificities and their significance to the gene clusters, but also the complex structure of the Gene Ontology. Finally, Bayesian probability is applied to making inference from heterogeneous genomic data, integrated with previous efforts in this thesis, for the aim of large-scale gene network inference. The proposed system comes with a stochastic process to achieve robustness to noise, yet remains efficient enough for large-scale analysis. Ultimately, the solutions presented in this thesis serve as building blocks of an intelligent system for interpreting large-scale genomic data and understanding the functional organisation of the genome.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Technologies to enhance self-directed learning from hypertext

    Get PDF
    With the growing popularity of the World Wide Web, materials presented to learners in the form of hypertext have become a major instructional resource. Despite the potential of hypertext to facilitate access to learning materials, self-directed learning from hypertext is often associated with many concerns. Self-directed learners, due to their different viewpoints, may follow different navigation paths, and thus they will have different interactions with knowledge. Therefore, learners can end up being disoriented or cognitively-overloaded due to the potential gap between what they need and what actually exists on the Web. In addition, while a lot of research has gone into supporting the task of finding web resources, less attention has been paid to the task of supporting the interpretation of Web pages. The inability to interpret the content of pages leads learners to interrupt their current browsing activities to seek help from other human resources or explanatory learning materials. Such activity can weaken learner engagement and lower their motivation to learn. This thesis aims to promote self-directed learning from hypertext resources by proposing solutions to the above problems. It first presents Knowledge Puzzle, a tool that proposes a constructivist approach to learn from the Web. Its main contribution to Web-based learning is that self-directed learners will be able to adapt the path of instruction and the structure of hypertext to their way of thinking, regardless of how the Web content is delivered. This can effectively reduce the gap between what they need and what exists on the Web. SWLinker is another system proposed in this thesis with the aim of supporting the interpretation of Web pages using ontology based semantic annotation. It is an extension to the Internet Explorer Web browser that automatically creates a semantic layer of explanatory information and instructional guidance over Web pages. It also aims to break the conventional view of Web browsing as an individual activity by leveraging the notion of ontology-based collaborative browsing. Both of the tools presented in this thesis were evaluated by students within the context of particular learning tasks. The results show that they effectively fulfilled the intended goals by facilitating learning from hypertext without introducing high overheads in terms of usability or browsing efforts

    Provenance, propagation and quality of biological annotation

    Get PDF
    PhD ThesisBiological databases have become an integral part of the life sciences, being used to store, organise and share ever-increasing quantities and types of data. Biological databases are typically centred around raw data, with individual entries being assigned to a single piece of biological data, such as a DNA sequence. Although essential, a reader can obtain little information from the raw data alone. Therefore, many databases aim to supplement their entries with annotation, allowing the current knowledge about the underlying data to be conveyed to a reader. Although annotations come in many di erent forms, most databases provide some form of free text annotation. Given that annotations can form the foundations of future work, it is important that a user is able to evaluate the quality and correctness of an annotation. However, this is rarely straightforward. The amount of annotation, and the way in which it is curated, varies between databases. For example, the production of an annotation in some databases is entirely automated, without any manual intervention. Further, sections of annotations may be reused, being propagated between entries and, potentially, external databases. This provenance and curation information is not always apparent to a user. The work described within this thesis explores issues relating to biological annotation quality. While the most valuable annotation is often contained within free text, its lack of structure makes it hard to assess. Initially, this work describes a generic approach that allows textual annotations to be quantitatively measured. This approach is based upon the application of Zipf's Law to words within textual annotation, resulting in a single value, . The relationship between the value and Zipf's principle of least e ort provides an indication as to the annotations quality, whilst also allowing annotations to be quantitatively compared. Secondly, the thesis focuses on determining annotation provenance and tracking any subsequent propagation. This is achieved through the development of a visualisation - i - framework, which exploits the reuse of sentences within annotations. Utilising this framework a number of propagation patterns were identi ed, which on analysis appear to indicate low quality and erroneous annotation. Together, these approaches increase our understanding in the textual characteristics of biological annotation, and suggests that this understanding can be used to increase the overall quality of these resources

    Conceptualization of Computational Modeling Approaches and Interpretation of the Role of Neuroimaging Indices in Pathomechanisms for Pre-Clinical Detection of Alzheimer Disease

    Get PDF
    With swift advancements in next-generation sequencing technologies alongside the voluminous growth of biological data, a diversity of various data resources such as databases and web services have been created to facilitate data management, accessibility, and analysis. However, the burden of interoperability between dynamically growing data resources is an increasingly rate-limiting step in biomedicine, specifically concerning neurodegeneration. Over the years, massive investments and technological advancements for dementia research have resulted in large proportions of unmined data. Accordingly, there is an essential need for intelligent as well as integrative approaches to mine available data and substantiate novel research outcomes. Semantic frameworks provide a unique possibility to integrate multiple heterogeneous, high-resolution data resources with semantic integrity using standardized ontologies and vocabularies for context- specific domains. In this current work, (i) the functionality of a semantically structured terminology for mining pathway relevant knowledge from the literature, called Pathway Terminology System, is demonstrated and (ii) a context-specific high granularity semantic framework for neurodegenerative diseases, known as NeuroRDF, is presented. Neurodegenerative disorders are especially complex as they are characterized by widespread manifestations and the potential for dramatic alterations in disease progression over time. Early detection and prediction strategies through clinical pointers can provide promising solutions for effective treatment of AD. In the current work, we have presented the importance of bridging the gap between clinical and molecular biomarkers to effectively contribute to dementia research. Moreover, we address the need for a formalized framework called NIFT to automatically mine relevant clinical knowledge from the literature for substantiating high-resolution cause-and-effect models

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    Extracting knowledge from documents related with invasive fungal infections in iron overload context

    Get PDF
    Dissertação de Mestrado em BioinformáticaInvasive fungal infections caused by Candida are associated with high mortality and morbidity rates in hospitalized patients. Iron plays a major role in these infections, as they are exacerbated under iron overload conditions. In this context, it is important to understand the association between iron levels and invasive fungal infections, as it can serve as an indicator of the severity of the disease, and eventually it can help establish measures to improve treatment efficacy. Nowadays, manually inferring these associations from biomedical documents is a time consuming task, due to the high amount of available scientific text data. As such, these tasks naturally benefit from the Biomedical Text Mining field, which includes a wide variety of methods for automatic extraction of high-quality information from biomedical text documents. In this work, relevant documents related to iron overload and fungal infections were retrieved from PubMed to build a corpus. Then, both Named Entity Recognition and Relation Extraction processes were executed using the @Note text mining tool. Finally, relevant sentences were manually extracted and a curated dataset with documents containing those sentences was created. Since the number of publications obtained about Candida and iron overload was very low, the analysis was made taking into account all fungi. A total of 15 publications were considered relevant and 168 relevant associations were extracted. Although associations of iron levels with both severity of infection and treatment efficacy were not extracted, it was possible to conclude that, in many cases, iron overload is a predictor for fungal infections, and patients’ iron levels highly affect treatment efficacy. The Biomedical Text Mining process described in the present thesis enabled the creation of a dataset of relevant biomedical publications containing interesting associations between fungal infections, drugs and associated diseases in a clinical context of iron overload, although in the future this process could be improved, especially regarding dictionaries, in order to obtain a higher number of relevant publications.As infeções fúngicas invasivas causadas por Candida estão associadas a elevadas taxas de mortalidade e morbilidade em doentes hospitalizados. O ferro tem um papel importante neste tipo de infeções, visto que estas são exacerbadas em condições de excesso de ferro. Neste contexto, é extremamente importante compreender a associação entre os níveis de ferro e infeções fúngicas invasivas, pois pode servir como indicador da severidade da doença e, eventualmente, ajudar a estabelecer medidas para melhorar a eficácia de tratamento. Atualmente, inferir manualmente este tipo de associações de documentos biomédicos revela-se uma tarefa bastante demorada, devido ao elevado volume de dados de texto científico disponíveis. Como tal, estas tarefas beneficiam claramente da área da mineração de textos biomédicos, que inclui uma ampla variedade de métodos para extração de informação de alta qualidade de documentos de texto biomédicos. No presente trabalho, foram identificados, inicialmente, documentos relevantes que associam o ferro com infeções fúngicas invasivas para construir um corpus. De seguida, os processos de Reconhecimento de entidades nomeadas e Extração de relações foram realizados usando a ferramenta de mineração de textos @Note. Finalmente, as frases mais relevantes foram extraídas e foi criado um corpus curado de documentos contendo essas mesmas frases. Visto que o número de publicações obtidas relacionadas com Candida e excesso de ferro foi muito baixo, a análise foi feita tendo em conta todos os fungos. Um total de 15 publicações foram consideradas relevantes e 168 associações foram extraídas. Embora não tivesse sido possível extrair associações entre níveis de ferro e a eficácia do tratamento/severidade da infeção, foi possível concluir que o excesso de ferro prevê o surgimento de infeções fúngicas em muitos casos, e que os níveis de ferro dos pacientes afetam fortemente a eficácia do tratamento. O processo de mineração de textos biomédicos no presente trabalho possibilitou a criação de um corpus de publicações biomédicas relevantes contendo associações interessantes entre infeções fúngicas, fármacos e doenças associadas, no contexto clínico de excesso de ferro, embora este processo pudesse ser melhorado no futuro, especialmente no que diz respeito aos dicionários, para que seja possível a obtenção de um maior número de publicações relevantes

    Developing a workflow for the multi-omics analysis of Daphnia

    Get PDF
    In the era of multi-omics, making reasonable statistical inferences through data integration is challenged by data heterogeneity, dimensionality constraints, and data harmonization. The biological system is presumed to function as a network where the physical relationships between genes (nodes) are represented by links (edges) connecting genes that interact. This thesis aims to develop a new and efficient workflow to analyse non-model organism multi-omics data for researchers who are entangled in the biology questions by using readily available software tools. The proposed approach was applied to the transcriptome and metabolome data of Daphnia magna under various dose rates of gamma radiation. The first part of this workflow compares and contrasts the transcriptional regulation of short-and long-term gamma radiation exposure. A group of genes which share a similar expression across different samples under the same conditions are known as modules, because they are likely to be functionally relevant. Modules were identified using WGCNA but biologically meaningful modules (significant modules) were selected through a novel approach that associates genes with significantly altered expression levels as a result of radiation (i.e. differentially expressed genes) with these candidate modules. Dynamic transcriptional regulation was modelled using transcription factor (TF) DNA binding patterns to associate TFs with expression responses captured by the modules. The biological functions of significant modules and their TF regulators were verified with functional annotations and mapped into the proposed Adverse Outcome Pathways (AOP) of D. magna, which describes the key events which contribute to fecundity reduction. The findings demonstrate that short term radiation impacts are entirely different from long term and cannot be used for long term prediction. The second part investigates the coordination of gene expression and metabolites with differential abundances induced by different gamma dose rates and the underlying mechanisms contributing to the varying extent of the reduction in fecundity. Significant modules which belong to the same design model of dose rates were combined and annotated with new functionality. The abundance of metabolites was also modelled with the same design model. Integrated pathway enrichment analysis was performed to discover and create pathway diagrams for visualising the multi-omics output. Finally, the performance of this workflow on explaining the reduction of fecundity of D. magna, which has not been described in previous studies, has been evaluated. Combining the information from the metabolome and transcriptome data, new insights suggest that the alteration to the cell cycle is the underlying mechanism contributing to the varying reduction of fecundity under the effect of different dose rates of radiation.M-G

    Literature mining and network analysis in Biology

    Get PDF
    Η παρούσα διπλωματική παρουσιάζει το OnTheFly2.0, ένα διαδικτυακό εργαλείο που επικεντρώνεται στην εξαγωγή και επακόλουθη ανάλυση βιοϊατρικών όρων από μεμονωμένα αρχεία. Συγκεκριμένα, το OnTheFly2.0 υποστηρίζει πολλούς διαφορετικούς επιτρέποντας τον παράλληλο χειρισμό τους. Μέσω της ενσωμάτωσης της υπηρεσίας EXTRACT υλοποιείται η Αναγνώριση Ονοματικών Οντοτήτων (Named Entity Recognition) για γονίδια/πρωτεΐνες, χημικές ουσίες, οργανισμούς, ιστούς, περιβάλλοντα, ασθένειες, φαινοτύπους και όρους οντολογίας γονιδίων (Gene Ontology terms), καθώς και η δημιουργία αναδυόμενων παραθύρων που παρέχουν πληροφορίες για τον αναγνωρισμένο όρο, συνοδευόμενες από σύνδεσμο για διάφορες βάσεις δεδομένων. Οι αναγνωρισμένες πρωτεΐνες, τα γονίδια και οι χημικές ουσίες μπορούν να επεξεργαστούν περαιτέρω μέσω αναλύσεων εμπλουτισμού για τη λειτουργικότητα και τη βιβλιογραφία ή να συσχετιστούν με ασθένειες και πρωτεϊνικές δομές. Τέλος, είναι δυνατή η απεικόνιση αλληλεπιδράσεων μεταξύ πρωτεϊνών ή μεταξύ πρωτεϊνών και χημικών ουσιών μέσω της δημιουργίας διαδραστικών δικτύων από τις βάσεις STRING και STITCH αντίστοιχα. Το OnTheFly2.0 υποστηρίζει 197 διαφορετικά είδη οργανισμών και είναι διαθέσιμο στον παρακάτω σύνδεσμο: http://onthefly.pavlopouloslab.info.The particular thesis presents OnTheFly2.0, a web-based, versatile tool dedicated to the extraction and subsequent analysis of biomedical terms from individual files. More specifically, OnTheFly2.0 supports different file formats, enabling simultaneous file handling. The integration of the EXTRACT tagging service allows the implementation of Named Entity Recognition (NER) for genes/proteins, chemical compounds, organisms, tissues, environments, diseases, phenotypes and Gene Ontology terms, as well as the generation of popup windows which provide concise, context related information about the identified term, accompanied by links to various databases. Once named entities, such as proteins, genes and chemicals are identified, they can be further explored via functional and publication enrichment analysis or be associated with diseases and protein domains reporting from protein family databases. Finally, visualization of protein-protein and protein-chemical associations is possible through the generation of interactive networks from the STRING and STITCH services, respectively. OnTheFly2.0 currently supports 197 species and is available at http://onthefly.pavlopouloslab.info
    corecore