68 research outputs found

    2016 Undergraduate Research Symposium Abstract Book

    Get PDF
    Abstract book from the 2016 Sixteenth Annual UMM Undergraduate Research Symposium (URS) which celebrates student scholarly achievement and creative activities

    고차원 유전체 자료에서의 유전자-유전자 상호작용 분석

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2015. 2. 박태성.With the development of high-throughput genotyping and sequencing technology, there are growing evidences of association with genetic variants and common complex traits. In spite of thousands of genetic variants discovered, such genetic markers have been shown to explain only a very small proportion of the underlying genetic variance of complex traits. Gene-gene interaction (GGI) analysis and rare variant analysis is expected to unveil a large portion of unexplained heritability of complex traits. In GGI, there are several practical issues. First, in order to conduct GGI analysis with high-dimensional genomic data, GGI methods requires the efficient computation and high accuracy. Second, it is hard to detect GGI for rare variants due to its sparsity. Third, analysing GGI using genome-wide scale suffers from a computational burden as exploring a huge search space. It requires much greater number of tests to find optimal GGI. For k variants, we have k(k-1)/2 combinations for two-order interactions, and nCk combinations for n-order interactions. The number of possible interaction models increase exponentially as the interaction order increases or the number of variant increases. Forth, though the biological interpretation of GGI is important, it is hard to interpret GGI due to its complex manner. In order to overcome these four main issues in GGI analysis with high-dimensional genomic data, the four novel methods are proposed. First, to provide efficient GGI method, we propose IGENT, Information theory-based GEnome-wide gene-gene iNTeraction method. IGENT is an efficient algorithm for identifying genome-wide GGI and gene-environment interaction (GEI). For detecting significant GGIs in genome-wide scale, it is important to reduce computational burden significantly. IGENT uses information gain (IG) and evaluates its significance without resampling. Through our simulation studies, the power of the IGENT is shown to be better than or equivalent to that of that of BOOST. The proposed method successfully detected GGI for bipolar disorder in the Wellcome Trust Case Control Consortium (WTCCC) and age-related macular degeneration (AMD). Second, for GGI analysis of rare variants, we propose a new gene-gene interaction method in the framework of the multifactor dimensionality reduction (MDR) analysis. The proposed method consists of two steps. The first step is to collapse the rare variants in a specific region such as gene. The second step is to perform MDR analysis for the collapsed rare variants. The proposed method is applied in whole exome sequencing data of Korean population to identify causal gene-gene interaction for rare variants for type 2 diabetes (T2D). Third, to increase computational performance for GGI in genome-wide scale, we developed CUDA (Compute Unified Device Architecture) based genome-wide association MDR (cuGWAM) software using efficient hardware accelerators. cuGWAM has better performance than CPU-based MDR methods and other GPU-based methods through our simulation studies. Fourth, to efficiently provide the statistical interpretation and biological evidences of gene-gene interactions, we developed the VisEpis, a tool for visualizing of gene-gene interactions in genetic association analysis and mapping of epistatic interaction to the biological evidence from public interaction databases. Using interaction network and circular plot, the VisEpis provides to explore the interaction network integrated with biological evidences in epigenetic regulation, splicing, transcription, translation and post-translation level. To aid statistical interaction in genotype level, the VisEpis provides checkerboard, pairwise checkerboard, forest, funnel and ring chart.Abstract i Contents iv List of Figures viii List of Tables xi 1 Introduction 1 1.1 Background of high-dimensional genomic data 1 1.1.1 History of genome-wide association studies (GWAS) 1 1.1.2 Association studies of massively parallel sequencing (MPS) 3 1.1.3 Missing heritability and proposed alternative methods 6 1.2 Purpose and novelty of this study 7 1.3 Outline of the thesis 8 2 Overview of gene-gene interaction 9 2.1 Definition of gene-gene interaction 9 2.2 Practical issues of gene-gene interaction 12 2.3 Overview of gene-gene interaction methods 14 2.3.1 Regression-based gene-gene interaction methods 14 2.3.2 Multifactor dimensionality reduction (MDR) 15 2.3.3 Gene-gene interaction methods using machine learning methods 18 2.3.3 Entropy-based method gene-gene interaction methods 20 3 Entropy-based Gene-gene interaction 22 3.1 Introduction 22 3.2 Methods 23 3.2.1 Entropy-based gene-gene interaction analysis 23 3.2.2 Exhaustive searching approach and Stepwise selection approach 24 3.2.3 Simulation setting 27 3.2.4 Genome-wide data for Biopolar disorder (BD) 31 3.2.5 Genome-wide data for Age-related macular degeneration (AMD) 31 3.3 Results 33 3.3.1 Simulation results 33 3.3.2 Analysis of WTCCC bipolar disorder (BD) data 43 3.3.3 Analysis of age-related macular degeneration (AMD) data 44 3.4 Discussion 47 3.5 Conclusion 47 4 Gene-gene interaction for rare variants 48 4.1 Introduction 48 4.2 Methods 50 4.2.1 Collapsing-based gene-gene interaction 50 4.2.2 Simulation setting 50 4.3 Results 55 4.3.1 Simulation study 55 4.3.2 Real data analysis of the Type 2 diabetes data 55 4.4 Discussion and Conclusion 68 5 Computation enhancement for gene-gene interaction 5.1 Introduction 69 5.2 Methods 71 5.2.1 MDR implementation 71 5.2.2 Implementation using high-performance computation based on GPU 72 5.2.3 Environment of performance comparison 75 5.3 Results 76 5.3.1 Computational improvement 76 5.4 Discussion 84 5.5 Conclusion 87 6 Visualization for gene-gene interaction interpretation 88 6.1 Introduction 88 6.2 Methods 91 6.2.1 Interaction mapping procedure 91 6.2.1 Checker board plot 91 6.2.2 Forest and funnel plot 94 6.3 Case study 100 6.3.1 Interpretation of gene-gene interaction in WTCC bipolar disorder data 100 6.3.2 Interpretation of gene-gene interaction in Age-related macular degeneration (AMD) data 101 6.4 Conclusion 102 7 Summary and Conclusion 103 Bibliography 107 Abstract (Korean) 113Docto

    Pacific Symposium on Biocomputing 2023

    Get PDF
    The Pacific Symposium on Biocomputing (PSB) 2023 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2023 will be held on January 3-7, 2023 in Kohala Coast, Hawaii. Tutorials and workshops will be offered prior to the start of the conference.PSB 2023 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. It is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders of research in biocomputing's 'hot topics.' In this way, the meeting provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field

    Genetics of membranous nephropathy

    Get PDF
    Autoimmune membranous nephropathy (AMN) is a rare kidney disease. The genetics of AMN have been partially elucidated and confirmed the role of phospholipase A2 receptor-1 (PLA2R1) and HLA. The functional effect of the genetic variations is not fully understood. This thesis investigates these unexplored genetic aspects utilising a range of methodologies and unique cohorts. Analysing genomic sequencing data of PLA2R1 in 335 AMN patients identified 109 strongly associated variants; 9 with a very strong association, p-value <10-50. In a larger cohort of 1158 European AMN patients, the findings from previous GWAS were confirmed with a strong association with HLA-DQA1, HLA-DRB1 and PLA2R1. No associations were found on a genome wide scale with clinical correlates of disease such as proteinuria, sex, and age. HLA typing by imputation in 372 anti-PLA2R1 antibody positive and uniquely 32 antithrombospondin type-1 domain-containing 7A (THSD7A) antibody positive AMN confirmed the dominant HLA type in European AMN as HLA-DRB1*03:01 and HLADQA1*05:01; replicating previous studies. No statistically significant HLA type was identified for anti-THSD7A AMN. Anti-PLA2R1 AMN has a different genetic risk than anti-THSD7A and anti-contactin AMN as determined by the genetic risk score (GRS), and this can help differentiate between them. Interestingly, 33% of dual antibody negative AMN is likely to be anti-PLA2R1 AMN. AMN patients with a higher genetic risk have a younger age of onset. In a rare, undescribed cohort of 15 non-familial paediatric cases of AMN the GRS proved that these individuals did not have the same genetic risk factors as anti-PLA2R1 AMN. Finally, the genetic risk of AMN in UK Biobank Europeans is 0.8%. Even though there is a high genetic risk for AMN this does not mean this proportion of individuals will develop AMN. In conclusion, this thesis highlights important differences between antibody status groups, confirms previous GWAS findings and reports unique features about rare AMN cohorts

    구조 변이 기반 인간 게놈 특성 규명을 위한 생물정보학 연구

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 농생명공학부, 2014. 8. 김희발.지난 몇 년 동안 질병 관련 유전체 구조적 변이 (단일염기 다형성과 유전자 복제 수 변이) 연구에 대한 노력이 계속되고 있다. 단일염기 다형성은 참조유전체와 비교하여 DNA 염기서열에서 하나의 염기서열의 차이를 가지고 유전자 복제 수 변이는 1,000 개 이상의 구조적 변이이다. 전장유전체연관분석은 유전체 구조적 변이와 질병에 관한 후보유전자를 찾는데 많이 연구되고 있다. 데이터 마이닝은 복잡하고 많은 양의 정보를 통찰하는데 중요하다. 이러한 생물학적 네트워크는 연구자가 정보를 통하여 복잡한 문제에 대한 의미론적 해답을 찾는데 도움을 준다. 따라서, 이 논문의 목표는 한국인에서 간 질병과 관련된 유전적 변이를 찾고, 간 기능이나 인종 차이에 영향을 미치는 생물학적 네트워크를 구축하여 이에 대한 의미론적 해답을 찾고 유전체 구조적 변이에 대한 시각화 툴을 구축하는데 있다. 제 1 장에서는 유전자 복제 수 변이, 전장유전체연관분석과 생물학적 네트워크에 관하여 기술하였다. 1) 유전자 복제 수 변이에 대한 개요와 원천 및 찾는 방법을 기술하였고 연구동향과 질병에서의 역할을 정리하였다. 2) 전장유전체연관분석에 대한 개요와 배경을 정리하였고 방법 및 결과를 요약하였다. 3) 생물학적 네트워크에 관한 개요 및 연구동향을 정리하였다. 제 2 장에서는 한국인에 관한 간 형질과 유전자 복제 수 변이의 메타연관분석을 수행하였다. KARE1 파트에서는 1) 한국인 8,842 명에 대해 총 10,162 개의 유전자 복제 수 변이를 찾았고, 2) 간 형질에 대한 유전자 복제 수 변이의 영향을 보기 위하여 단일 선형 회귀 분석을 수행하였다. 그 결과, AST 와 ALT 에 대해서 각각 100 개와 16 개가 유의하게 나왔다. 3) 그 유의한 유전자 복제 수 변이의 지역에 39 개의 유전자가 위치해 있었고 4) 그 유전자에 대해 기능적 분류 분석 결과, 간 관련 후보유전자로서 인정이 되었다. KARE2 파트에서는 KARE1 파트의 반복 유전체연관분석을 수행하였다. 1) 한국인 407 명에 대해 총 3,046 개의 유전자 복제 수 변이를 찾았고, 2) 단일 선형 회귀 분석을 이용하여 유전자 복제 수 변이와 간 형질과의 연관분석을 수행하였다. 그 결과, AST 와 ALT 에 대해서 각각 32 개 (140 개의 유전자)와 42 개 (172 개의 유전자)가 유의하게 나왔다. 3) 반복분석결과, 한국인의 유전자 복제 수 변이와 간 관련하여 총 9 개의 유전자가 유의하게 나왔다. 제 3 장에서는 간 기능과 인종 차이를 나타내는 유전자 복제 수 관련 생물학적 네트워크를 구축하였다. 노드는 유전자, 질병, 대사, 화학물질, 약, 임상정보, 변이 등으로 구성되어있고, 연결은 유전자-질병, 유전자-변이, 유전자-화학물질, 대사-질병, 대사-화학물질, 화학물질-약, 질병-임상정보, 임상정보-약 등으로 구성되어있다. 생물학적 네트워크 분석을 통해 한국인 간 기능 유전자 복제 수 변이 관련 총 4 개의 질병과 1 개의 대사회로 및 7 개의 약을 밝혀내었고, 인종 차이 유전자 복제 수 변이 관련 총 3 개의 질병과 1 개의 약 및 5 개의 대사회로를 밝혀내었다. 제 4 장에서는 유전자 복제 수 변이와 단일염기다형성의 시각화를 위한 툴을 구축하였다. 총 6 개의 메뉴로 1) 유전자 복제 수 변이나 단일염기다형성의 위치에 풍부한 요소 검사와 2) 염색체상의 변이 위치 분포 3) log2 ratio 분포 4) binning 단위 당 변위 분포 5) homozygosity 분포 6) cytomapping 시각화로 구성되어있다. 이 툴은 값으로 나타나는 변이로부터 생물학적 의미를 쉽게 이해하는데 도움을 주고, 또한 어떤 설치나 다운로드 없이 쉽게 이용 가능하다. 전장유전체 연관분석을 통해 한국인의 유전자 복제 수 변이와 간 형질 관련 유력한 후보유전자를 찾을 수 있었고, 간 질병과 인종차이 유전자 복제 수 변이관련 의미론적 생물학 네트워크를 구축할 수 있었다. 또한 다양한 유전자 복제 수 변이 연구를 함으로써 축적되어온 변이 시각화를 위한 총집합적 툴을 개발하였다. 이러한 네트워크와 시각화 툴은 질병이나 인종 관련 유전자 복제 수 변이의 의미론적 생물학 의미 발견이 가능하고 시각화 툴은 값으로 나타나는 유전자 복제 수 변이로부터 생물학적 해석에 도움이 된다.Over the past few years, efforts focused on investigating the effects of copy number variations (CNVs) in human disease have been continuing. Genetic differences are attributable in part to large-scale structural variations between individuals. CNV is a form of structural variation as a DNA segment ≥ 1 kb in size when compared to a reference genome. Therefore, CNV was used to identify what associated with susceptibility and resistance to diseases. Genome-wide association studies (GWAS) have been used to investigate novel candidate genes associated with complex traits. Many of studies have been reported the association between SNPs or CNVs and complex diseases. Also, several GWA studies have been applied to a personalized medicine. Data mining provided important insights into the data with complicated and huge quantity. These semantic networks have given researchers knowledgeable information answers to complex questions through integration of the available data. Therefore, this thesis is to identify the genetic variation associated with liver diseases between Koreans, construct biological networks to understand the semantic knowledge about liver functions or ethnic disparities, and develop the visualization tool to explain a biological meaning for CNVs or SNPs. In chapter 1, the general background of CNV, GWAS, and biological network were summarized. First, for CNV, the general overview, mechanism sources, identification methods, various researches in human, and associations with complex diseases were presented. Second, for GWAS, the general overview, biological background, various methods, result findings, clinical application, and limitations were presented. Third, for biological network, the general overview and biological network systems were presented. In chapter 2, two parts (KARE1 and KARE2) were constituted as replication studies of GWA (genome-wide association) for hepatic biochemical markers AST or ALT in Korean cohorts. In KARE1, the analysis of CNVs in 8,842 Koreans reveals thirty-nine genes associated with hepatic biochemical markers AST (aspartate aminotransferase) and/or ALT (alanine aminotransferase). I genotyped on Affymetrix Genome-Wide Human 5.0 arrays for all samples and identified 10,162 CNVs using HelixTree software (ver. 7.0). To explain the impact of CNVs on each quantitative trait (AST or ALT), univariate linear regression was performed. As the result, 100 CNVs were significant for AST and 16 were significant for ALT at the significance level of 5%. I identified thirty-nine genes located within the significant CNV regions. According to the functional annotation by using DAVID tool, the CNV-based genes are likely to be associated with liver diseases. In KARE2, a study of GWA for hepatic biomarkers was investigated in 407 Korean cohorts. Affymetrix Genome-Wide Human 6.0 array was genotyped for all samples and CNVs were identified using HelixTree software. By using univariate linear regression, 32 and 42 CNVs showed significance for AST and ALT, respectively (p-value < 0.05). To replication study of GWA for hepatic biomarker, CNV-based genes between KARE1 (AST-1885, ALT-773) and KARE2 (AST-140, ALT-172) were compared using NetBox software. As a result, nine genes (CIDEB, DFFA, PSMA3, PSMC5, PSMC6, PSMD12, PSMF1, SDC4, and SIAH1) were overlapped for AST, yet no overlapping genes were found for ALT. Structural variation analysis of CNV-based genes is useful to understand the biological phenotypes or diseases. In chapter 3, to identify knowledgeable biological meanings for complex big data, two biological networks were constructed on liver functions or ethnic disparities using BioXM software. These semantic networks contained entities (Gene, Disease, Pathway, Chemical, Drug, SNP, CNV, ClinicalTrials, GO, drug, and SomaticMutation) and relationships between two entities (Gene-GO, Gene-Pathway, Gene-Disease, Gene-Chemical, Gene-SNP, Gene-CNV, Gene-SomaticMutation, Pathway-Chemical, Pathway-Chemical, Pathway-Disease, Chemical-Drug, ClinicalTrials-Disease, and ClinicalTrials-Drug). The application of the semantic liver functions network using the KARE2 data are shown in three clusters, including four diseases, one pathway, and seven drugs. Ethnic disparities network was constructed using the ethnic specific SNP-based genes. By eliminating the overlapped SNPs from HapMap samples, ethnic specific SNPs were identified and the SNP-based genes were mapped to the UCSC RefGene lists (ver. hg18). As a result, ethnic specific 22, 25, and 332 genes were identified in the CEU (USA), JPT (Japan), and YRI (Africa) individuals, respectively. The application of ethnic disparities network showed interesting results in the three categories, including three diseases, one drug, and five pathways. The majority of these findings were consistent with the previous studies that an understanding of genetic variability explained ethnic disparities. In chapter 4, VCS (Visualization of CNVs or SNPs) tool was constructed to visualize CNVs or SNPs detected in animals such as mammals, vertebrates, insects, and worms. VCS can easily interpret a biological meaning from the numerical value of CNVs or SNPs. The VCS provides six visualization tools: (ⅰ) the enrichment of genome contents in CNV region(ⅱ) the physical distribution of CNV or SNP on chromosomes(ⅲ) the distribution of log2 ratio of CNVs with criteria of interested(ⅳ) the number distribution of CNVs or SNPs per binning unit (10 kb, 100 kb, 1Mb, and 10Mb)(ⅴ) the homozygosity distribution of SNP genotype on chromosomesand (ⅵ) cytomap of genes within CNVs or SNPs. By GWAS analyzing between CNVs and hepatic biochemical markers AST or ALT, a lot of biological meaning associated with liver diseases in Korean cohorts could be obtained. Also, semantic biological networks for liver functions or ethnic disparities could be obtained knowledgeable findings. Finally, VCS tool could be achieved by interpreting a biological meaning from the numerical value by graphical viewing, and offered more directly insertable tip-top figures in study. Therefore, in this thesis, I analyzed replication study of GWA for hepatic biomarkers AST or ALT (Chapter 2), constructed the semantic biological networks for liver functions or ethnic disparities (Chapter 3), and developed the VCS web-tool to visualize the CNVs or SNPs (Chapter 4).ABSTRACT I CONTENTS VI LIST OF TABLES VIII LIST OF FIGURES X GENERAL INTRODUCTION XIII CHAPTER 1. LITERATURE REVIEW 1 1.1 COPY NUMBER VARIATION (CNV) 2 1.2 GENOME-WIDE ASSOCIATION STUDY (GWAS) 7 1.3 BIOLOGICAL NETWORK 14 CHAPTER 2. A REPLICATION STUDY OF GWA BETWEEN CNVS AND HEPATIC BIOMARKERS AST OR ALT IN KOREAN COHORTS 16 2.1 ABSTRACT 17 2.2 INTRODUCTION 19 2.3 MATERIALS AND METHODS 23 2.4 RESULTS 27 2.5 DISCUSSION 45 CHAPTER 3. BIOLOGICAL NETWORKS TO IDENTIFY KNOWLEDGEABLE MEANINGS FOR LIVER FUNCTIONS OR ETHNIC DISPARITIES 52 3.1 ABSTRACT 53 3.2 INTRODUCTION 55 3.3 MATERIALS AND METHODS 57 3.4 RESULTS 60 3.5 DISCUSSION 81 CHAPTER 4. VCS: TOOL FOR VISUALIZING COPY NUMBER VARIATION AND SINGLE NUCLEOTIDE POLYMORPHISM 87 4.1 ABSTRACT 88 4.2 INTRODUCTION 90 4.3 PROGRAM OVERVIEW 92 4.4 IMPLEMENTATION 109 GENERAL DISCUSSION 111 REFERENCES 113 SUPPLEMENTARY MATERIALS 133 요약(국문초록) 176Docto

    Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity

    Get PDF
    The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience

    Understanding Optimisation Processes with Biologically-Inspired Visualisations

    Get PDF
    Evolutionary algorithms (EAs) constitute a branch of artificial intelligence utilised to evolve solutions to solve optimisation problems abound in industry and research. EAs often generate many solutions and visualisation has been a primary strategy to display EA solutions, given that visualisation is a multi-domain well-evaluated medium to comprehend extensive data. The endeavour of visualising solutions is inherent with challenges resulting from high dimensional phenomenons and the large number of solutions to display. Recently, scholars have produced methods to mitigate some of these known issues when illustrating solutions. However, one key consideration is that displaying the final subset of solutions exclusively (rather than the whole population) discards most of the informativeness of the search, creating inadequate insight into the black-box EA. There is an unequivocal knowledge gap and requirement for methods which can visualise the whole population of solutions from an optimiser and subjugate the high-dimensional problems and scaling issues to create interpretability of the EA search process. Furthermore, a requirement for explainability in evolutionary computing has been demanded by the evolutionary computing community, which could take the form of visualisations, to support EA comprehension much like the support explainable artificial intelligence has brought to artificial intelligence. In this thesis, we report novel visualisation methods that can be used to visualise large and high-dimensional optimiser populations with the aim of creating greater interpretability during a search. We consider the nascent intersection of visualisation and explainability in evolutionary computing. The potential high informativeness of a visualisation method from an early chapter of this work forms an effective platform to develop an explainability visualisation method, namely the population dynamics plot, to attempt to inject explainability into the inner workings of the search process. We further support the visualisation of populations using machine learning to construct models which can capture the characteristics of an EA search and develop intelligent visualisations which use artificial intelligence to potentially enhance and support visualisation for a more informative search process. The methods developed in this thesis are evaluated both quantitatively and qualitatively. We use multi-feature benchmark problems to show the method’s ability to reveal specific problem characteristics such as disconnected fronts, local optima and bias, as well as potentially creating a better understanding of the problem landscape and optimiser search for evaluating and comparing algorithm performance (we show the visualisation method to be more insightful than conventional metrics like hypervolume alone). One of the most insightful methods developed in this thesis can produce a visualisation requiring less than 1% of the time and memory necessary to produce a visualisation of the same objective space solutions using existing methods. This allows for greater scalability and the use in short compile time applications such as online visualisations. Predicated by an existing visualisation method in this thesis, we then develop and apply an explainability method to a real-world problem and evaluate it to show the method to be highly effective at explaining the search via solutions in the objective spaces, solution lineage and solution variation operators to compactly comprehend, evaluate and communicate the search of an optimiser, although we note the explainability properties are only evaluated against the author’s ability and could be evaluated further in future work with a usability study. The work is then supported by the development of intelligent visualisation models that may allow one to predict solutions in optima (importantly local optima) in unseen problems by using a machine learning model. The results are effective, with some models able to predict and visualise solution optima with a balanced F1 accuracy metric of 96%. The results of this thesis provide a suite of visualisations which aims to provide greater informativeness of the search and scalability than previously existing literature. The work develops one of the first explainability methods aiming to create greater insight into the search space, solution lineage and reproductive operators. The work applies machine learning to potentially enhance EA understanding via visualisation. These models could also be used for a number of applications outside visualisation. Ultimately, the work provides novel methods for all EA stakeholders which aims to support understanding, evaluation and communication of EA processes with visualisation

    Multi-omics approaches to sickle cell disease heterogeneity

    Full text link
    La drépanocytose est une maladie causée par une seule mutation dans le gène de la bêta-globine. Les complications liées à la maladie se manifestent sur le plan génétique, épigénique, transcriptionnel, et métabolique. Les approches intégratives des technologies de séquençage à haut-débit permettent de comprendre le mécanisme pathologique et de découvrir des thérapies en lien avec la maladie. Dans cette thèse, j’intègre divers jeux de données omiques et j’applique des méthodes statistiques pour élaborer de nouvelles hypothèses et analyser les données. Dans les deux premières études, je combine les résultats des études d'association pangénomique d'hémoglobine fœtale (HbF) et des globules rouges denses déshydratés (DRBC) avec l'expression génique, l'interaction chromatinienne, les bases de données relatives aux maladies et les cibles médicamenteuses sélectionnées par des experts. Cette approche intégrative a révélé trois nouveaux loci sur le chromosome 10 (BICC1), le chromosome 19 (KLF1) et le chromosome 22 (CECR2) comme régulateurs de l'HbF. Pour l’étude sur la densité de globules rouges, quatre cibles médicamenteuses (BCL6, LRRC32, KNCJ14 et LETM1) ont été identifiées comme des modulateurs potentiels de la sévérité. Dans la troisième étude, j’intégre la métabolomique à la génomique pour établir une relation causale entre la L-glutamine et les crises douleurs en utilisant la randomisation mendélienne. En outre, nous avons identifié 66 biomarqueurs pour 6 complications liées à la drépanocytose et le débit de filtration glomérulaire estimé (DFGe). Enfin, dans la dernière étude j’ai appliqué une approche de clustering aux métabolites que j’ai ensuite combiné aux données de génotype. J’ai découvert des changements métabolomiques mettant en évidence des familles de métabolites impliqués dans les dysfonctionnements rénaux et hépatiques, en plus de confirmer le rôle d'une classe d'acides gras dans la formation en faucille des globules rouges. Ce travail met en évidence l'importance des approches multi-omiques pour découvrir de nouveaux mécanismes biologiques et étudier les maladies humaines.Sickle cell disease is a monogenic disorder caused by a point mutation in the beta-globin gene. The complications related to the disease are characterized by a broad spectrum of distinct genetic, epigenetic, transcriptional, and metabolomic states. Integrative high-throughput technologies approaches to sickle cell disease pathophysiology are crucial to understanding complications mechanisms and uncovering therapeutic interventions. In this thesis, I integrate various omics datasets and apply statistical methods to derive new hypotheses and analyze data. I combine genome-wide association studies results of fetal hemoglobin (HbF) and dehydrated dense red blood cells (DRBC) with gene expression, chromatin interaction, disease-relevant databases, and expert-curated drug targets. This integrative approach revealed three novel loci on chromosome 10 (BICC1), chromosome 19 (KLF1) and chromosome 22 (CECR2) as key modulators of HbF. For DRBC, four drug targets (BCL6, LRRC32, KNCJ14, and LETM1) were identified as potential severity modifiers. Using mendelian randomization, I integrated metabolomics with genomics in the third study to establish a potential causal relationship between L-glutamine and painful crisis. Additionally, we identified 66 biomarkers for 6 SCD-related complications and estimated glomerular filtration rate (eGFR). Finally, the last study applied a clustering framework to metabolites which I then combined with genotypes. I found specific metabolomics changes highlighting families of metabolites involved in renal and liver dysfunction and confirming the role of a class of fatty acids in red blood cell sickling. This work highlights the importance of multi-omics approaches to unearth new biology and study human diseases
    corecore