65 research outputs found

    Sekventiaalisen tiedon louhinta : segmenttirakenteita etsimässä

    Get PDF
    Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.Segmentointi on tiedon louhinnassa käytetty menetelmä, jonka avulla voidaan tuottaa yksinkertaisia kuvauksia sekvenssistä, joka koostuu järjestetystä jonosta pisteitä. Pisteet voivat olla joko yksi- tai moniulotteisia. Segmentoinnissa sekvenssi jaetaan tiettyyn määrään yhtenäisiä alueita, segmenttejä, ja kunkin alueen sisältämiä pisteitä kuvataan yhdellä arvolla. Väitöskirjassa keskitytään paloittain vakioiden segmenttirakenteiden etsintään. Tällaisille rakenteille kunkin segmentin paras kuvaus sekä koko sekvenssin paras jako segmentteihin voidaan laskea tehokkaasti. Tiedon mallintaminen segmentoinnin avulla on hyödyllistä mm. silloin kun tietoa tallennetaan ja indeksoidaan sekvenssitietokannoissa, sekä kun halutaan saada lisätietoja tietyn sekvenssin yleisrakenteesta. Väitöskirjassa käsitellään ensin segmentointiin liittyviä peruskysymyksiä, segmenttien lukumäärän valitsemista ja segmentointitulosten arviointia. Olemassa olevien mallinvalintamenetelmien näytetään soveltuvan hyvin segmenttien lukumäärän valitsemiseen. Segmentointien arviointia käsitellään suhteessa tunnettuun segmenttirakenteeseen. Voidaan näyttää, että segmentoimalla sekvenssi sen tiettyjen ominaisuuksien suhteen saadaan tulokseksi segmentointeja, joiden samankaltaisuus tunnetun rakenteen kanssa on merkitsevä. Perinteiseen segmentointikehykseen esitellään kaksi laajennosta: yksihuippuinen segmentointi ja kantasegmentointi. Yksihuippuisessa segmentoinnissa segmenttien kuvaukset saavat arvoja, jotka ensin kasvavat ja sitten vähenevät. Kantasegmentoinnissa puolestaan mallinnetaan segmenttien sekä sekvenssin eri ulottuvuuksien välisiä suhteita. Väitöskirjassa määritellään nämä kaksi uutta segmentointiongelmaa. Lisäksi sekä annetaan että analysoidaan laskennallisia menetelmiä, algoritmeja, niiden ratkaisemiseksi. Segmentointimenetelmiä sovelletaan käytännössä mm. aikasarjojen, tietovirtojen, tekstin ja biologisten sekvenssien analysoinnissa. Väitöskirjassa käsitellään esimerkinomaisesti segmentoinnin soveltamista genomisekvenssien analysoinnissa

    The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color

    Get PDF
    Background Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. Results We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. Conclusions We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits

    Transcriptome characterization and differentially expressed genes under flooding and drought stress in the biomass grasses Phalaris arundinacea and Dactylis glomerata

    Get PDF
    peer-reviewedBackground and Aims Perennial grasses are a global resource as forage, and for alternative uses in bioenergy and as raw materials for the processing industry. Marginal lands can be valuable for perennial biomass grass production, if perennial biomass grasses can cope with adverse abiotic environmental stresses such as drought and waterlogging. Methods In this study, two perennial grass species, reed canary grass (Phalaris arundinacea) and cocksfoot (Dactylis glomerata) were subjected to drought and waterlogging stress to study their responses for insights to improving environmental stress tolerance. Physiological responses were recorded, reference transcriptomes established and differential gene expression investigated between control and stress conditions. We applied a robust non-parametric method, RoDEO, based on rank ordering of transcripts to investigate differential gene expression. Furthermore, we extended and validated vRoDEO for comparing samples with varying sequencing depths. Key Results This allowed us to identify expressed genes under drought and waterlogging whilst using only a limited number of RNA sequencing experiments. Validating the methodology, several differentially expressed candidate genes involved in the stage 3 step-wise scheme in detoxification and degradation of xenobiotics were recovered, while several novel stress-related genes classified as of unknown function were discovered. Conclusions Reed canary grass is a species coping particularly well with flooding conditions, but this study adds novel information on how its transcriptome reacts under drought stress. We built extensive transcriptomes for the two investigated C3 species cocksfoot and reed canary grass under both extremes of water stress to provide a clear comparison amongst the two species to broaden our horizon for comparative studies, but further confirmation of the data would be ideal to obtain a more detailed picture.FP7 grant GrassMargin

    Determining significance of pairwise co-occurrences of events in bursty sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes and therefore potentially more binding sites, while in some, possibly very long regions, hardly any events occur. Also some types of events may occur in the sequence more often than others.</p> <p>Tendencies of co-occurrence of binding sites of two or more TFs are interesting, as they may imply a co-operative role between the TFs in regulatory processes. Determining a numerical value to summarize the tendency for co-occurrence between two TFs can be done in a number of ways. However, testing for the significance of such values should be done with respect to a relevant null model that takes into account the global sequence structure.</p> <p>Results</p> <p>We extend the existing techniques that have been considered for determining the significance of co-occurrence patterns between a pair of event types under different null models. These models range from very simple ones to more complex models that take the burstiness of sequences into account. We evaluate the models and techniques on synthetic event sequences, and on real data consisting of potential transcription factor binding sites.</p> <p>Conclusion</p> <p>We show that simple null models are poorly suited for bursty data, and they yield many false positives. More sophisticated models give better results in our experiments. We also demonstrate the effect of the window size, i.e., maximum co-occurrence distance, on the significance results.</p

    Efficient computation of Faith's phylogenetic diversity with applications in characterizing microbiomes

    Get PDF
    The number of publicly available microbiome samples is continually growing. As data set size increases, bottlenecks arise in standard analytical pipelines. Faith's phylogenetic diversity (Faith's PD) is a highly utilized phylogenetic alpha diversity metric that has thus far failed to effectively scale to trees with millions of vertices. Stacked Faith's phylogenetic diversity (SFPhD) enables calculation of this widely adopted diversity metric at a much larger scale by implementing a computationally efficient algorithm. The algorithm reduces the amount of computational resources required, resulting in more accessible software with a reduced carbon footprint, as compared to previous approaches. The new algorithm produces identical results to the previous method. We further demonstrate that the phylogenetic aspect of Faith's PD provides increased power in detecting diversity differences between younger and older populations in the FINRISK study's metagenomic data.Peer reviewe

    Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

    Get PDF
    We introduce the operational genomic unit (OGU) method, a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent of taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance, and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldom applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome data sets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project data set and more accurate prediction of human age by the gut microbiomes of Finnish individuals included in the FINRISK 2002 cohort. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate adoption of the OGU method in future metagenomics studies. IMPORTANCE Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution. To solve these challenges, we introduce operational genomic units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition and (ii) permitting use of phylogeny-aware tools. Our analysis of real-world data sets shows that it is advantageous over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGUs as an effective practice in metagenomic studies.Peer reviewe

    Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results

    Get PDF
    Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness

    Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

    Get PDF
    We introduce the operational genomic unit (OGU) method, a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent of taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance, and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldom applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome data sets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project data set and more accurate prediction of human age by the gut microbiomes of Finnish individuals included in the FINRISK 2002 cohort. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate adoption of the OGU method in future metagenomics studies.IMPORTANCE Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution. To solve these challenges, we introduce operational genomic units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition and (ii) permitting use of phylogeny-aware tools. Our analysis of real-world data sets shows that it is advantageous over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGUs as an effective practice in metagenomic studies.</p
    • …
    corecore