16 research outputs found

    Bioinformatics Analyses of Alternative Splicing: Predition of alternative splicing events in animals and plants using Machine Learning and analysis of the extent and conservation of subtle alternative splicing

    Get PDF
    Alternatives Spleißen (AS) ist ein Mechanismus, durch den ein Multi-Exon-Gen verschiedene Transkripte und damit verschiedene Proteine exprimieren kann. AS trĂ€gt wesentlich zur KomplexitĂ€t und Vielfalt eukaryotischer Transkriptome und Proteome bei. Die Bioinformatik hat in den vergangenen zehn Jahren entscheidenden BeitrĂ€ge zu unserem VerstĂ€ndnis des AS in Bezug auf Verbreitung, Umfang und Konservierung der verschiedenen Klassen, Evolution, Regulierung und biologische Funktion geliefert. Zum Nachweis des AS im großen Maßstab wurden meist Verfahren zur Genom- und Transkriptom-weiten Alignierung von EST- und mRNA-Daten sowie Microarray-Analysen eingesetzt, die weitestgehend auf bioinformatischen Methoden basieren. Diese wurden durch rechnergestĂŒtzte Verfahren zur Charakterisierung und Vorhersage von AS ergĂ€nzt, die zeigen, wie sich konstitutive und alternative Spleißorte sowie Exons unterscheiden. Die vorliegende Dissertationsschrift beschĂ€ftigt sich mit bioinformatischen Analysen ausgewĂ€hlter Aspekte des AS. Im ersten Teil habe ich Verfahren zur Vorhersage des AS entwickelt, ohne dabei auf DatensĂ€tze exprimierter Sequenzen zurĂŒckzugreifen. Insbesondere habe ich AnsĂ€tze zur Vorhersage von Kassetten-Exons mittels Bayessches Netze (BN) weiterentwickelt und neue diskriminierende Merkmale etabliert. Diese verbesserten deutlich die Richtig-Positiv-Rate von publizierten 50% auf 61%, bei einer stringenten Falsch-Positiv-Rate von nur 0,5%. Ich konnte zeigen, dass Exons, die als konstitutiv gekennzeichnet waren, denen aber durch das BN eine hohe Wahrscheinlichkeit zugeweisen wurde, alternativ zu sein, in der Tat durch neueste Expressionsdaten als alternativ bestĂ€tigt wurden. Bei gleichen DatensĂ€tzen und Merkmalen entspricht die LeistungsfĂ€higkeit eines BN der einer publizierten Support-Vektor-Maschine (SVM), was darauf hinweist, dass verlĂ€ssliche Ergebnisse bei der Klassifikation mehr von den Merkmalen als von der Wahl des Klassifikators abhĂ€ngen. Im zweiten Teil habe ich den BN-Ansatz auf eine umfangreiche und evolutionĂ€r weit verbreitete Klasse von AS-Ereignissen ausgeweitet, die als NAGNAG-Tandem-Spleißstellen bezeichnet werden und bei denen die alternativen Spleißorte nur 3 Nukleotide (nt) voneinander getrennt sind. Die sorgfĂ€ltige Zusammenstellung der Trainings- und Test-DatensĂ€tze bei der Vorhersage des NAGNAG-AS trug zu einer ausgewogenen SensitivitĂ€t und SpezifitĂ€t von 92% bei. Vorhersagen eines auf dem vereinigten Datensatz trainierten BN konnten in 81% (38/47) der FĂ€lle experimentell bestĂ€tigt werden. Im Rahmen dieser Studie wurde damit einer der gegenwĂ€rtig umfangreichsten DatensĂ€tze zur experimentellen Verifizierung von Vorhersagen des AS generiert. Ein BN, trainiert anhand menschlicher Daten, erzielt Ă€hnliche gute Ergebnisse bei vier anderen Wirbeltier-Genomen. Nur leichte Einbußen bei Vorhersagen fĂŒr Drosophila melanogaster und Caenorhabditis elegans weisen darauf hin, dass der zugrunde liegende Spleißmechanismus ĂŒber weite evolutionĂ€re Distanzen konserviert zu seien scheint. Schließlich verwendete ich die Vorhersagegenauigkeit der experimentellen Validierung, um die Zahl der noch unentdeckten alternativen NAGNAGs abzuschĂ€tzen. Die Ergebnisse deuten darauf hin, dass der Mechanismus des NAGNAG-AS einfach, stochastisch und konserviert ist - unter Wirbeltieren und darĂŒber hinaus. Des weiteren habe ich den BN-Ansatz zur Charakterisierung und Vorhersage von NAGNAG-AS in Physcomitrella patens, einem Moos, eingesetzt. Dies ist eine der ersten Studien zur Vorhersage von AS in Pflanzen, ohne dabei auf DatensĂ€tze von exprimierten Sequenzen zurĂŒckzugreifen. Wir erreichten Ă€hnliche Ergebnisse, wie in unseren anderen Arbeiten zur Vorhersage NAGNAG-AS. Eine unabhĂ€ngige Validierung mittels 454-NextGen-Sequenzdaten zeigte Richtig-Positiv-Raten von 64%-79% fĂŒr gut unterstĂŒtzt FĂ€lle von NAGNAG-AS. Damit scheint der Mechanismus des NAGNAG-AS bei Pflanzen dem der Tiere zu Ă€hneln

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    CHARACTERIZATION OF SINGLE RESIDUE VARIATIONS IN THE HUMAN POPULATION AND IN DISEASE: FUNCTIONAL IMPACT, STRUCTURAL IMPACT, AND DISTRIBUTION PATTERN

    Get PDF
    We have investigated the properties of three sets of human missense genetic variations: cancer somatic mutations, monogenic disease causing mutations, and population SNPs, from the point of view of their impact on molecular function, distribution propensity in different protein structure environments, and disease mechanism. Cancer genome sequencing projects have identified a large number of somatic missense mutations in cancers. We have used two analysis methods in the SNPs3D software package to assess the impact of these variants on protein function in vivo. One method identifies those mutations that significantly destabilize three dimensional protein structure, and the other detects all types of effect on protein function, utilizing sequence conservation. Data from a set of breast and colorectal tumors were analyzed. In known cancer genes, approaching 100% of missense mutations are found to impact protein function, supporting the view that these methods are appropriate for identifying driver mutations. Overall, we estimate that 50% to 60% of all somatic missense mutations have a high impact on structure stability or more generally affect the function of the corresponding proteins. This fraction is similar to the fraction of all possible missense mutations that have high impact, and much higher than the corresponding one for human population SNPs, at about 30%. We found that the majority of mutations in tumor suppressors destabilize protein structure, while mutations in oncogenes operate in more varied ways, including destabilization of the less active conformational states. A set of possible drivers with high impact is suggested. We also studied a set of germline missense variants in phenylalanine hydroxylase, found in phenylketonuria (PKU) patients. With the aid of SNPs3D, we reinforced the previous finding that a high proportion of disease missense mutations affect protein stability, rather than other aspects of protein structure and function. We then focused on the relationship between the presence of these stability damaging missense mutations and the corresponding experimental data for the level and activity of the PAH protein product present under `in vivo' like conditions. We found that, overall, destabilizing mutations result in substantially lower protein levels, but with the maintenance of wild type like specific activity. The overall agreement between predicted stability impact and experimental evidence for lower protein levels is high, and in accordance with the previous estimates of error rates for the methods. We next investigated the involvement of missense single base variants in the interface between two interacting proteins and their role in disease. This work consisted of three steps: first, mapping of variants onto the protein structure and identification of those in the interaction interfaces; second, distribution enrichment analysis in three structure locations (protein interior, surface, and interface); and third, impact analysis with SNPs3D. Nearly a quarter of disease causing mutations are mapped onto protein interfaces, with a strong propensity for the heteromeric interfaces, indicating that interruption of functional contacts between proteins is a significant disease mechanism. We found the enrichment propensity in the interfaces is intermediate between protein surface and interior for all three types of variants considered, namely SNPs, inter-species variants, and disease mutations. We also found missense SNPs and inter-species variants share the same enrichment pattern, with a relatively high density on the protein surface and depletion in the interior. In contrast, the disease mutations display the reverse pattern, with interior and interface the most susceptible places

    Analysis of bronchoalveolar lavage transcriptome profiles of asthmatic horses by single-cell mRNA sequencing

    Get PDF
    Severe equine asthma (SEA) is a common respiratory condition of horses, whose underlying immune mechanisms remain to be elucidated. In this thesis project, we took advantage of the recently developed single-cell mRNA (scRNA-seq) technology to investigate the immunological landscape of equine bronchoalveolar lavage fluid (BALF) cells in both health and disease. Initially, we conducted a pilot experiment involving three horses to demonstrate the feasibility of scRNA-seq on cryopreserved equine BALF samples. Although the experiment was successful, the proportion of reads aligning to the annotated equine reference transcriptome was suboptimal. To address this, we generated a custom equine BALF transcriptome using long-read sequencing, aiming to improve the quality of 3'-UTR annotation and document BALF-specific isoforms. While we identified several novel isoforms, the read mapping percentage did not improve when aligning our scRNA-seq transcripts to the custom transcriptome. By extending the 3'-UTRs of the existing reference annotation, we achieved a satisfactory read mapping percentage, enabling subsequent qualitative downstream analysis. Our scRNA-seq dataset encompassed six major cell populations: monocytes-macrophages, neutrophils, T cells, B cells and dendritic cells. Within the monocyte-macrophage and T cell groups, we identified previously uncharacterized cell subtypes. Encouraged by these findings, we applied our optimized experimental protocol and analysis pipeline to study SEA. ScRNA-seq analysis of cryopreserved BALF cells from 6 asthmatic and 5 healthy controls revealed the same major cell populations as observed in the pilot study. In addition to T cells and monocytes-macrophages, we characterized several cell subtypes within the B cell, dendritic cell and neutrophil populations. Differential gene expression analysis revealed a strong T helper (Th)17 signature in SEA, primarily driven by monocytes-macrophages and T cells. Notably, BALF from SEA horses was enriched in B cells, with a lower proportion of activated plasma cells. Neutrophils in the SEA group displayed increased migratory capacity and a heightened propensity to form neutrophil extracellular traps (NETs). An intriguing finding in both scRNA-seq experiments was the detection of a dual monocyte-lymphocyte population, potentially representing genuine cellular complexes engaged in an immunological synapse. In summary, this thesis project represents pioneering work employing scRNA-seq in the field of equine pulmonology. Our findings support a predominant Th17 immune pathway in SEA, necessitating further investigation to improve diagnostic tools and therapeutic management of severely asthmatic horses

    The Gene Ontology Handbook

    Get PDF
    bioinformatics; biotechnolog

    Nuclear receptor networks in the normal breast and breast cancer

    Get PDF
    Nuclear receptors (NRs) have been targets of intensive drug development for decades due to their roles as key regulators of multiple developmental, physiological and disease processes. In the normal breast, a number of NRs are reported to be differentially expressed in different epithelial breast cell lineages and likely play a role in the differentiation and maintenance of the normal breast epithelial cell lineages. In breast cancer, expression of the estrogen and progesterone receptors remains clinically important in predicting prognosis and determining therapeutic strategies. More recently, there is growing evidence suggesting the involvement of multiple nuclear receptors other than the estrogen and progesterone receptors, in the regulation of various processes important to the initiation and progression of breast cancer. Identification of key NRs and the pathways they govern in the normal breast and breast cancer is important to our understanding of normal breastdevelopment and pave the way for rational design of prognostic and therapeutic targets for breast cancer. This thesis systematically investigates the expression and co-expression networks of NRs in the normal breast and how they are perturbed in breast cancer with a focus on the identification of network-based prognostic markers for breast cancer. This is done through analysis of multiple expression datasets, both publicly available and in-house generated, of primary normal breast and breast cancer tissues. Among the main findings of this work is the identification of NRs differentially expressed in normal breast epithelial cells at single cell level and the observation that there are major changes in the NR co-expression networks in breast cancer compared to the normal breast. We showed that cancer associated changes in NR co-expression networks are clinically relevant and that these changes can be used to identify NRs with prognostic values in estrogen receptor negative (ER-), HER2 and Basal subgroups of breast cancer. In addition, we demonstrated the utility of co-expression analysis in the identification of potential crosstalk in the signalling networks of different NRs by investigating the potential crosstalk of of MR and RARB in the normal breast and breast cancer

    On quantitative issues pertaining to the detection of epistatic genetic architectures

    Get PDF
    Converging empirical evidence portrays epistasis (i.e., gene-gene interaction) as a ubiquitous property of genetic architectures and protagonist in complex trait variability. While researchers employ sophisticated technologies to detect epistasis, the scarcity of robust instances of detection in human populations is striking. To evaluate the empirical issues pertaining to epistatic detection, we analytically characterize the statistical detection problem and elucidate two candidate explanations. The first examines whether population-level manifestations of epistasis arising in nature are small; consequently, for sample-sizes employed in research, the power delivered by detectors may be disadvantageously small. The second considers whether gene-environmental association generates bias in estimates of genotypic values diminishing the power of detection. By simulation study, we adjudicate the merits of both explanations and the power to detect epistasis under four digenic architectures. In agreement with both explanations, our findings implicate small epistatic effect-sizes and gene-environmental association as mechanisms that obscure the detection of epistasis

    Automatic generation of software interfaces for supporting decisionmaking processes. An application of domain engineering & machine learning

    Get PDF
    [EN] Data analysis is a key process to foster knowledge generation in particular domains or fields of study. With a strong informative foundation derived from the analysis of collected data, decision-makers can make strategic choices with the aim of obtaining valuable benefits in their specific areas of action. However, given the steady growth of data volumes, data analysis needs to rely on powerful tools to enable knowledge extraction. Information dashboards offer a software solution to analyze large volumes of data visually to identify patterns and relations and make decisions according to the presented information. But decision-makers may have different goals and, consequently, different necessities regarding their dashboards. Moreover, the variety of data sources, structures, and domains can hamper the design and implementation of these tools. This Ph.D. Thesis tackles the challenge of improving the development process of information dashboards and data visualizations while enhancing their quality and features in terms of personalization, usability, and flexibility, among others. Several research activities have been carried out to support this thesis. First, a systematic literature mapping and review was performed to analyze different methodologies and solutions related to the automatic generation of tailored information dashboards. The outcomes of the review led to the selection of a modeldriven approach in combination with the software product line paradigm to deal with the automatic generation of information dashboards. In this context, a meta-model was developed following a domain engineering approach. This meta-model represents the skeleton of information dashboards and data visualizations through the abstraction of their components and features and has been the backbone of the subsequent generative pipeline of these tools. The meta-model and generative pipeline have been tested through their integration in different scenarios, both theoretical and practical. Regarding the theoretical dimension of the research, the meta-model has been successfully integrated with other meta-model to support knowledge generation in learning ecosystems, and as a framework to conceptualize and instantiate information dashboards in different domains. In terms of the practical applications, the focus has been put on how to transform the meta-model into an instance adapted to a specific context, and how to finally transform this later model into code, i.e., the final, functional product. These practical scenarios involved the automatic generation of dashboards in the context of a Ph.D. Programme, the application of Artificial Intelligence algorithms in the process, and the development of a graphical instantiation platform that combines the meta-model and the generative pipeline into a visual generation system. Finally, different case studies have been conducted in the employment and employability, health, and education domains. The number of applications of the meta-model in theoretical and practical dimensions and domains is also a result itself. Every outcome associated to this thesis is driven by the dashboard meta-model, which also proves its versatility and flexibility when it comes to conceptualize, generate, and capture knowledge related to dashboards and data visualizations

    Quantifying Human Dietary Change over the Last 30,000 Years

    Get PDF
    Dietary change has been linked to many aspects of human evolution over the last three million years, including tool use, brain size increase, aerobic capacity and gut biology. Furthermore, failure to adapt to dietary changes over the last 10,000 years has been implicated in a number of complex and chronic diseases including obesity, type II diabetes, some cancers and coronary heart disease. Such ‘diseases of modernity’ are more common in agrarian and industrial societies than among hunter-gatherers, and it has been argued that this is due to a mismatch between modern diets and the ancestral diets to which our metabolism should be optimised. The aims of this research have grown out of the qualitative studies that perpetuate narratives around human and hominin diets, particularly around the central theme of dietary mismatch and ‘paleo’-named diets. In this work, I investigate nutrient-level differences between modern post-industrial diets, modern hunter-gatherer diets, prehistoric (Palaeolithic, Neolithic and Bronze Age) diets reconstructed from archaeological data, clinical intervention diets, fad diets including The Paleo Diet, Keto Diet and Atkins Diet, fast food diets and milk. Using these data, I develop a hypothesis on the evolution of dietary choice. Modern diets are enriched for certain nutrients, for some of which we have strong taste avidities (e.g. sodium, sucrose, starch, certain fatty acids). By quantifying differences in inferred nutrient profiles between ancestral and modern diets, I examine the nutrients enriched in modern diets, the trajectories of nutrient composition change through time, what might be driving these changes, and why we have evolved taste preferences for some nutrients that in a modern setting are considered ‘unhealthy’. I also examine how nutrients correlate in ancestral foods and explore if avidities for nutrients enriched in modern diets would lead to healthy nutrient profiles in an ancestral setting
    corecore