1,765 research outputs found

    Maize chlorotic mottle virus exhibits low divergence between differentiated regional sub-populations.

    Get PDF
    Maize chlorotic mottle virus has been rapidly spreading around the globe over the past decade. The interactions of maize chlorotic mottle virus with Potyviridae viruses causes an aggressive synergistic viral condition - maize lethal necrosis, which can cause total yield loss. Maize production in sub-Saharan Africa, where it is the most important cereal, is threatened by the arrival of maize lethal necrosis. We obtained maize chlorotic mottle virus genome sequences from across East Africa and for the first time from Ecuador and Hawaii, and constructed a phylogeny which highlights the similarity of Chinese to African isolates, and Ecuadorian to Hawaiian isolates. We used a measure of clustering, the adjusted Rand index, to extract region-specific SNPs and coding variation that can be used for diagnostics. The population genetics analysis we performed shows that the majority of sequence diversity is partitioned between populations, with diversity extremely low within China and East Africa

    Clustering of Cases from Di erent Subtypes of Breast Cancer Using a Hop eld Network Built from Multi-omic Data

    Get PDF
    Tesis de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Computación, 2018Despite scienti c advances, breast cancer still constitutes a worldwide major cause of death among women. Given the great heterogeneity between cases, distinct classi cation schemes have emerged. The intrinsic molecular subtype classi cation (luminal A, luminal B, HER2- enriched and basal-like) accounts for the molecular characteristics and prognosis of tumors, which provides valuable input for taking optimal treatment actions. Also, recent advancements in molecular biology have provided scientists with high quality and diversity of omiclike data, opening up the possibility of creating computational models for improving and validating current subtyping systems. On this study, a Hop eld Network model for breast cancer subtyping and characterization was created using data from The Cancer Genome Atlas repository. Novel aspects include the usage of the network as a clustering mechanism and the integrated use of several molecular types of data (gene mRNA expression, miRNA expression and copy number variation). The results showed clustering capabilities for the network, but even so, trying to derive a biological model from a Hop eld Network might be di cult given the mirror attractor phenomena (every cluster might end up with an opposite). As a methodological aspect, Hop eld was compared with kmeans and OPTICS clustering algorithms. The last one, surprisingly, hints at the possibility of creating a high precision model that di erentiates between luminal, HER2-enriched and basal samples using only 10 genes. The normalization procedure of dividing gene expression values by their corresponding gene copy number appears to have contributed to the results. This opens up the possibility of exploring these kind of prediction models for implementing diagnostic tests at a lower cost

    A conserved BDNF, glutamate- and GABA-enriched gene module related to human depression identified by coexpression meta-analysis and DNA variant genome-wide association studies

    Get PDF
    Large scale gene expression (transcriptome) analysis and genome-wide association studies (GWAS) for single nucleotide polymorphisms have generated a considerable amount of gene- and disease-related information, but heterogeneity and various sources of noise have limited the discovery of disease mechanisms. As systematic dataset integration is becoming essential, we developed methods and performed meta-clustering of gene coexpression links in 11 transcriptome studies from postmortem brains of human subjects with major depressive disorder (MDD) and non-psychiatric control subjects. We next sought enrichment in the top 50 meta-analyzed coexpression modules for genes otherwise identified by GWAS for various sets of disorders. One coexpression module of 88 genes was consistently and significantly associated with GWAS for MDD, other neuropsychiatric disorders and brain functions, and for medical illnesses with elevated clinical risk of depression, but not for other diseases. In support of the superior discriminative power of this novel approach, we observed no significant enrichment for GWAS-related genes in coexpression modules extracted from single studies or in meta-modules using gene expression data from non-psychiatric control subjects. Genes in the identified module encode proteins implicated in neuronal signaling and structure, including glutamate metabotropic receptors (GRM1, GRM7), GABA receptors (GABRA2, GABRA4), and neurotrophic and development-related proteins [BDNF, reelin (RELN), Ephrin receptors (EPHA3, EPHA5)]. These results are consistent with the current understanding of molecular mechanisms of MDD and provide a set of putative interacting molecular partners, potentially reflecting components of a functional module across cells and biological pathways that are synchronously recruited in MDD, other brain disorders and MDD-related illnesses. Collectively, this study demonstrates the importance of integrating transcriptome data, gene coexpression modules and GWAS results for providing novel and complementary approaches to investigate the molecular pathology of MDD and other complex brain disorders. © 2014 Chang et al

    Computational approaches for single-cell omics and multi-omics data

    Get PDF
    Single-cell omics and multi-omics technologies have enabled the study of cellular heterogeneity with unprecedented resolution and the discovery of new cell types. The core of identifying heterogeneous cell types, both existing and novel ones, relies on efficient computational approaches, including especially cluster analysis. Additionally, gene regulatory network analysis and various integrative approaches are needed to combine data across studies and different multi-omics layers. This thesis comprehensively compared Bayesian clustering models for single-cell RNAsequencing (scRNA-seq) data and selected integrative approaches were used to study the cell-type specific gene regulation of uterus. Additionally, single-cell multi-omics data integration approaches for cell heterogeneity analysis were investigated. Article I investigated analytical approaches for cluster analysis in scRNA-seq data, particularly, latent Dirichlet allocation (LDA) and hierarchical Dirichlet process (HDP) models. The comparison of LDA and HDP together with the existing state-of-art methods revealed that topic modeling-based models can be useful in scRNA-seq cluster analysis. Evaluation of the cluster qualities for LDA and HDP with intrinsic and extrinsic cluster quality metrics indicated that the clustering performance of these methods is dataset dependent. Article II and Article III focused on cell-type specific integrative analysis of uterine or decidual stromal (dS) and natural killer (dNK) cells that are important for successful pregnancy. Article II integrated the existing preeclampsia RNA-seq studies of the decidua together with recent scRNA-seq datasets in order to investigate cell-type-specific contributions of early onset preeclampsia (EOP) and late onset preeclampsia (LOP). It was discovered that the dS marker genes were enriched for LOP downregulated genes and the dNK marker genes were enriched for upregulated EOP genes. Article III presented a gene regulatory network analysis for the subpopulations of dS and dNK cells. This study identified novel subpopulation specific transcription factors that promote decidualization of stromal cells and dNK mediated maternal immunotolerance. In Article IV, different strategies and methodological frameworks for data integration in single-cell multi-omics data analysis were reviewed in detail. Data integration methods were grouped into early, late and intermediate data integration strategies. The specific stage and order of data integration can have substantial effect on the results of the integrative analysis. The central details of the approaches were presented, and potential future directions were discussed.  Laskennallisia menetelmiä yksisolusekvensointi- ja multiomiikkatulosten analyyseihin Yksisolusekvensointitekniikat mahdollistavat solujen heterogeenisyyden tutkimuksen ennennäkemättömällä resoluutiolla ja uusien solutyyppien löytämisen. Solutyyppien tunnistamisessa keskeisessä roolissa on ryhmittely eli klusterointianalyysi. Myös geenien säätelyverkostojen sekä eri molekyylidatatasojen yhdistäminen on keskeistä analyysissä. Väitöskirjassa verrataan bayesilaisia klusterointimenetelmiä ja yhdistetään eri menetelmillä kerättyjä tietoja kohdun solutyyppispesifisessä geeninsäätelyanalyysissä. Lisäksi yksisolutiedon integraatiomenetelmiä selvitetään kattavasti. Julkaisu I keskittyy analyyttisten menetelmien, erityisesti latenttiin Dirichletallokaatioon (LDA) ja hierarkkiseen Dirichlet-prosessiin (HDP) perustuvien mallien tutkimiseen yksisoludatan klusterianalyysissä. Kattava vertailu näiden kahden mallin sekä olemassa olevien menetelmien kanssa paljasti, että aihemallinnuspohjaiset menetelmät voivat olla hyödyllisiä yksisoludatan klusterianalyysissä. Menetelmien suorituskyky riippui myös kunkin analysoitavan datasetin ominaisuuksista. Julkaisuissa II ja III keskitytään naisen lisääntymisterveydelle tärkeiden kohdun stroomasolujen ja NK-immuunisolujen solutyyppispesifiseen analyysiin. Artikkelissa II yhdistettiin olemassa olevia tuloksia pre-eklampsiasta viimeisimpiin yksisolusekvensointituloksiin ja löydettiin varhain alkavan pre-eklampsian (EOP) ja myöhään alkavan pre-eklampsian (LOP) solutyyppispesifisiä vaikutuksia. Havaittiin, että erilaistuneen strooman markkerigeenien ilmentyminen vähentyi LOP:ssa ja NK-markkerigeenien ilmentyminen lisääntyi EOP:ssa. Julkaisu III analysoi strooman ja NK-solujen alapopulaatiospesifisiä geeninsäätelyverkostoja ja niiden transkriptiofaktoreita. Tutkimus tunnisti uusia alapopulaatiospesifisiä säätelijöitä, jotka edistävät strooman erilaistumista ja NK-soluvälitteistä immunotoleranssia Julkaisu IV tarkastelee yksityiskohtaisesti strategioita ja menetelmiä erilaisten yksisoludatatasojen (multi-omiikka) integroimiseksi. Integrointimenetelmät ryhmiteltiin varhaisen, myöhäisen ja välivaiheen strategioihin ja kunkin lähestymistavan menetelmiä esiteltiin tarkemmin. Lisäksi keskusteltiin mahdollisista tulevaisuuden suunnista

    Identification of Chiari Type I Malformation subtypes using whole genome expression profiles and cranial base morphometrics.

    Get PDF
    BACKGROUND: Chiari Type I Malformation (CMI) is characterized by herniation of the cerebellar tonsils through the foramen magnum at the base of the skull, resulting in significant neurologic morbidity. As CMI patients display a high degree of clinical variability and multiple mechanisms have been proposed for tonsillar herniation, it is hypothesized that this heterogeneous disorder is due to multiple genetic and environmental factors. The purpose of the present study was to gain a better understanding of what factors contribute to this heterogeneity by using an unsupervised statistical approach to define disease subtypes within a case-only pediatric population. METHODS: A collection of forty-four pediatric CMI patients were ascertained to identify disease subtypes using whole genome expression profiles generated from patient blood and dura mater tissue samples, and radiological data consisting of posterior fossa (PF) morphometrics. Sparse k-means clustering and an extension to accommodate multiple data sources were used to cluster patients into more homogeneous groups using biological and radiological data both individually and collectively. RESULTS: All clustering analyses resulted in the significant identification of patient classes, with the pure biological classes derived from patient blood and dura mater samples demonstrating the strongest evidence. Those patient classes were further characterized by identifying enriched biological pathways, as well as correlated cranial base morphological and clinical traits. CONCLUSIONS: Our results implicate several strong biological candidates warranting further investigation from the dura expression analysis and also identified a blood gene expression profile corresponding to a global down-regulation in protein synthesis

    Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA, IDH1, EGFR, and NF1

    Get PDF
    The Cancer Genome Atlas Network recently cataloged recurrent genomic abnormalities in glioblastoma multiforme (GBM). We describe a robust gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrate multidimensional genomic data to establish patterns of somatic mutations and DNA copy number. Aberrations and gene expression of EGFR, NF1, and PDGFRA/IDH1 each define the Classical, Mesenchymal, and Proneural subtypes, respectively. Gene signatures of normal brain cell types show a strong relationship between subtypes and different neural lineages. Additionally, response to aggressive therapy differs by subtype, with the greatest benefit in the Classical subtype and no benefit in the Proneural subtype. We provide a framework that unifies transcriptomic and genomic dimensions for GBM molecular stratification with important implications for future studies

    Using signal processing, evolutionary computation, and machine learning to identify transposable elements in genomes

    Get PDF
    About half of the human genome consists of transposable elements (TE's), sequences that have many copies of themselves distributed throughout the genome. All genomes, from bacterial to human, contain TE's. TE's affect genome function by either creating proteins directly or affecting genome regulation. They serve as molecular fossils, giving clues to the evolutionary history of the organism. TE's are often challenging to identify because they are fragmentary or heavily mutated. In this thesis, novel features for the detection and study of TE's are developed. These features are of two types. The first type are statistical features based on the Fourier transform used to assess reading frame use. These features measure how different the reading frame use is from that of a random sequence, which reading frames the sequence is using, and the proportion of use of the active reading frames. The second type of feature, called side effect machine (SEM) features, are generated by finite state machines augmented with counters that track the number of times the state is visited. These counters then become features of the sequence. The number of possible SEM features is super-exponential in the number of states. New methods for selecting useful feature subsets that incorporate a genetic algorithm and a novel clustering method are introduced. The features produced reveal structural characteristics of the sequences of potential interest to biologists. A detailed analysis of the genetic algorithm, its fitness functions, and its fitness landscapes is performed. The features are used, together with features used in existing exon finding algorithms, to build classifiers that distinguish TE's from other genomic sequences in humans, fruit flies, and ciliates. The classifiers achieve high accuracy (> 85%) on a variety of TE classification problems. The classifiers are used to scan large genomes for TE's. In addition, the features are used to describe the TE's in the newly sequenced ciliate, Tetrahymena thermophile to provide information for biologists useful to them in forming hypotheses to test experimentally concerning the role of these TE's and the mechanisms that govern them

    Single-Cell Gene Expression Variation as A Cell-Type Specific Trait: A Study of Mammalian Gene Expression Using Single-Cell RNA Sequencing

    Get PDF
    In this dissertation, we used single-cell RNA sequencing data from five mammalian tissues to characterize patterns of gene expression across single cells, transcriptome-wide and in a cell-type-specific manner (Part 1). Additionally, we characterized single-cell RNA sequencing methods as a resource for experimental design and data analysis (Part 2). Part 1: Differentiation of metazoan cells requires execution of different gene expression programs but recent single cell transcriptome profiling has revealed considerable variation within cells of seemingly identical phenotype. This brings into question the relationship between transcriptome states and cell phenotypes. We used high quality single cell RNA sequencing for 107 single cells from five mammalian tissues, along with 30 control samples, to characterize transcriptome heterogeneity across single cells. We developed methods to filter genes for reliable quantification and to calibrate biological variation. We found evidence that ubiquitous expression across cells may be indicative of critical gene function and that, for a subset of genes, biological variability within each cell type may be regulated in order to perform dynamic functions. We also found evidence that single-cell variability of mouse pyramidal neurons was correlated with that in rats consistent with the hypothesis that levels of variation may be conserved. Part 2: Many researchers are interested in single-cell RNA sequencing for use in identification and classification of cell types, finding rare cells, and studying single-cell expression variation; however, experimental and analytic methods for single-cell RNA sequencing are young and there is little guidance available for planning experiments and interpreting results. We characterized single-cell RNA sequencing measurements in terms of sensitivity, precision and accuracy through analysis of data generated in a collaborative control project, where known reference RNA was diluted to single-cell levels and amplified using one of three single-cell RNA sequencing protocols. All methods perform comparably overall, but individual methods demonstrate unique strengths and biases. Measurement reliability increased with expression level for all methods and we conservatively estimated measurements to be quantitative at an expression level of ~5-10 molecules

    희박한 하이퍼그래프의 클러스터링과 그 활용

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 사범대학 수학교육과, 2017. 8. 유연주.Clustering is one of the most popular methods to extract meaningful patterns from data. In genomics, increasingly large amounts of DNA sequencing data are being generated. Developing effective clustering tools appropriate for each data structure is a major challenge. In this thesis, we develop the CLQ and CLQ-D clustering algorithm to partition SNP sequence data using theoretical graph-based approaches. Based on these clustering algorithms, the LD block construction method, Big-LD, is able to detect the LD blocks of SNP sequence data using interval graph modeling of the clustering results. A sparse graph is defined as a graph in which the actual number of edges is much less than the possible number of edges. Real-world data including biological, social, and internet network data can be modeled as sparse graphs. Due to the structural characteristic of SNP data, graph models constructed for the clustering algorithm and LD block construction algorithm have a sparse structure, which facilitates the efficient operation of the algorithm in terms of time and memory usage. The Big-LD algorithm detects LD blocks including "holes", which are not allowed in the previous methods. Based on the LD block structure constructed by the Big-LD algorithm, we investigated the relationships between big LD block structure and biological phenomena using the HapMap phase 3 data and phase 1 data of the 1000 Genomes Project. The LD block boundaries detected by the Big-LD algorithm coincided better with the recombination hotspots than previous methods. In addition, we demonstrate that the comparison of LD block structures can provide additional information about positive selection using the results applied to the candidate regions suggested by previous research. By generalizing the Big-LD algorithm, which is designed to partition SNP sequence data into blocks, we suggest four clustering algorithms—PSHSC, PSHRC, PSHSQ, and PSHRQ—for the sparse hypergraph partitioning problem. Simulation experiments demonstrated that the algorithms generate high-quality partitions in terms of global and local quality measures. The partitioning results closely agreed with the true underlying cluster structures of simulated hypergraphs. We also applied the developed algorithm to the problem of predicting protein complexes in yeast protein-protein interaction network data, and confirm its potential as a tool for clustering biological network datasets.Contents I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Objectives and contributions of the thesis . . . . . . . . . . . . . . 7 1.3 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . 9 II. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Basic graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Terminologies in genetics . . . . . . . . . . . . . . . . . . . . . . . 13 III. Clusteringofgenomicdata . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 CLQ algorithm . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2.2 CLQ-D: construction of LD bins . . . . . . . . . . . . . . . 21 3.2.3 Big-LD: construction of LD blocks . . . . . . . . . . . . . 24 3.3 Evaluation of Big-LD algorithm . . . . . . . . . . . . . . . . . . . 32 3.3.1 Evaluation data . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Implementation and performance evaluation . . . . . . . . . 33 3.3.3 Comparisons of Big-LD block partition results with preexisting methods . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.4 LD block and recombination hotspots . . . . . . . . . . . . 56 3.3.5 Multi-SNP association experiments using the results of different block partition methods . . . . . . . . . . . . . . . . 59 iii 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 IV. Comparisons of linkage disequilibrium blocks of different populationsforpositiveselection . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.1 Previous methods: XP-EHH and CMS tests . . . . . . . . . 74 4.2.2 Comparison measure of LD block partitions . . . . . . . . . 75 4.2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 V. Sparsehypergraphpartitioning . . . . . . . . . . . . . . . . . . . . 91 5.1 Motivation and background . . . . . . . . . . . . . . . . . . . . . . 91 5.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 Multi-level hypergraph partitioning . . . . . . . . . . . . . 94 5.2.2 Spectral clustering . . . . . . . . . . . . . . . . . . . . . . 96 5.2.3 Dense subgraph partitioning . . . . . . . . . . . . . . . . . 98 5.2.4 k-clique clustering . . . . . . . . . . . . . . . . . . . . . . 98 5.3 Hypergraph partitioning algorithms . . . . . . . . . . . . . . . . . 100 5.3.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . 100 5.3.2 Construction of the line graph of a hypergraph . . . . . . . 101 5.3.3 Listing the dense sets of the line graph . . . . . . . . . . . . 103 5.3.4 Finding the MWIS of the intersection graph . . . . . . . . 109 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.4.2 Quality measures . . . . . . . . . . . . . . . . . . . . . . . 116 iv 5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5 Detecting protein complexes in PPI networks . . . . . . . . . . . . 133 5.5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 140 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 VI. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 I. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A.1 Coding correction algorithm . . . . . . . . . . . . . . . . . . . . . 146 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Abstract(inKorean) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166Docto
    corecore