Search CORE

65 research outputs found

Sekventiaalisen tiedon louhinta : segmenttirakenteita etsimässä

Author: Haiminen Niina
Publication venue: 'University of Helsinki Libraries'
Publication date: 08/04/2008
Field of study

Segmentation is a data mining technique yielding simplified representations of sequences of ordered points. A sequence is divided into some number of homogeneous blocks, and all points within a segment are described by a single value. The focus in this thesis is on piecewise-constant segments, where the most likely description for each segment and the most likely segmentation into some number of blocks can be computed efficiently. Representing sequences as segmentations is useful in, e.g., storage and indexing tasks in sequence databases, and segmentation can be used as a tool in learning about the structure of a given sequence. The discussion in this thesis begins with basic questions related to segmentation analysis, such as choosing the number of segments, and evaluating the obtained segmentations. Standard model selection techniques are shown to perform well for the sequence segmentation task. Segmentation evaluation is proposed with respect to a known segmentation structure. Applying segmentation on certain features of a sequence is shown to yield segmentations that are significantly close to the known underlying structure. Two extensions to the basic segmentation framework are introduced: unimodal segmentation and basis segmentation. The former is concerned with segmentations where the segment descriptions first increase and then decrease, and the latter with the interplay between different dimensions and segments in the sequence. These problems are formally defined and algorithms for solving them are provided and analyzed. Practical applications for segmentation techniques include time series and data stream analysis, text analysis, and biological sequence analysis. In this thesis segmentation applications are demonstrated in analyzing genomic sequences.Segmentointi on tiedon louhinnassa käytetty menetelmä, jonka avulla voidaan tuottaa yksinkertaisia kuvauksia sekvenssistä, joka koostuu järjestetystä jonosta pisteitä. Pisteet voivat olla joko yksi- tai moniulotteisia. Segmentoinnissa sekvenssi jaetaan tiettyyn määrään yhtenäisiä alueita, segmenttejä, ja kunkin alueen sisältämiä pisteitä kuvataan yhdellä arvolla. Väitöskirjassa keskitytään paloittain vakioiden segmenttirakenteiden etsintään. Tällaisille rakenteille kunkin segmentin paras kuvaus sekä koko sekvenssin paras jako segmentteihin voidaan laskea tehokkaasti. Tiedon mallintaminen segmentoinnin avulla on hyödyllistä mm. silloin kun tietoa tallennetaan ja indeksoidaan sekvenssitietokannoissa, sekä kun halutaan saada lisätietoja tietyn sekvenssin yleisrakenteesta. Väitöskirjassa käsitellään ensin segmentointiin liittyviä peruskysymyksiä, segmenttien lukumäärän valitsemista ja segmentointitulosten arviointia. Olemassa olevien mallinvalintamenetelmien näytetään soveltuvan hyvin segmenttien lukumäärän valitsemiseen. Segmentointien arviointia käsitellään suhteessa tunnettuun segmenttirakenteeseen. Voidaan näyttää, että segmentoimalla sekvenssi sen tiettyjen ominaisuuksien suhteen saadaan tulokseksi segmentointeja, joiden samankaltaisuus tunnetun rakenteen kanssa on merkitsevä. Perinteiseen segmentointikehykseen esitellään kaksi laajennosta: yksihuippuinen segmentointi ja kantasegmentointi. Yksihuippuisessa segmentoinnissa segmenttien kuvaukset saavat arvoja, jotka ensin kasvavat ja sitten vähenevät. Kantasegmentoinnissa puolestaan mallinnetaan segmenttien sekä sekvenssin eri ulottuvuuksien välisiä suhteita. Väitöskirjassa määritellään nämä kaksi uutta segmentointiongelmaa. Lisäksi sekä annetaan että analysoidaan laskennallisia menetelmiä, algoritmeja, niiden ratkaisemiseksi. Segmentointimenetelmiä sovelletaan käytännössä mm. aikasarjojen, tietovirtojen, tekstin ja biologisten sekvenssien analysoinnissa. Väitöskirjassa käsitellään esimerkinomaisesti segmentoinnin soveltamista genomisekvenssien analysoinnissa

Helsingin yliopiston digitaalinen arkisto

Randomization techniques for assessing the significance of gene periodicity results

Author: Haiminen Niina
Kallio Aleksi
Mannila Heikki
Ojala Markus
Vuokko Niko
Publication venue
Publication date: 01/01/2011
Field of study

Peer reviewe

PubMed Central

Helsingin yliopiston digitaalinen arkisto

The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color

Author: Cornejo Omar
Findley Seth
Haiminen Niina
Livingstone Donald, III
Mockaitis Keithanne
Motamayor Juan C
Royaert Stefan
Saski Christopher
Schmutz Jeremy
Utro Filippo
Zheng Ping
Publication venue: Clemson University Libraries
Publication date: 01/06/2013
Field of study

Background Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. Results We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina 1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. Conclusions We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits

Clemson University: TigerPrints

Recommended from our members

Human Skin, Oral, and Gut Microbiomes Predict Chronological Age.

Author: Allaband Celeste
Belda-Ferre Pedro
Carrieri Anna-Paola
Haiminen Niina
Hu Rebecca
Huang Shi
Jiang Lingjing
Kim Ho-Cheol
Knight Rob
Parida Laxmi
Russell Baylee
Swafford Austin D
Vázquez-Baeza Yoshiki
Xu Zhenjiang Zech
Zarrinpar Amir
Zhou Hongwei
Publication venue: eScholarship, University of California
Publication date: 11/02/2020
Field of study

Human gut microbiomes are known to change with age, yet the relative value of human microbiomes across the body as predictors of age, and prediction robustness across populations is unknown. In this study, we tested the ability of the oral, gut, and skin (hand and forehead) microbiomes to predict age in adults using random forest regression on data combined from multiple publicly available studies, evaluating the models in each cohort individually. Intriguingly, the skin microbiome provides the best prediction of age (mean ± standard deviation, 3.8 ± 0.45 years, versus 4.5 ± 0.14 years for the oral microbiome and 11.5 ± 0.12 years for the gut microbiome). This also agrees with forensic studies showing that the skin microbiome predicts postmortem interval better than microbiomes from other body sites. Age prediction models constructed from the hand microbiome generalized to the forehead and vice versa, across cohorts, and results from the gut microbiome generalized across multiple cohorts (United States, United Kingdom, and China). Interestingly, taxa enriched in young individuals (18 to 30 years) tend to be more abundant and more prevalent than taxa enriched in elderly individuals (>60 yrs), suggesting a model in which physiological aging occurs concomitantly with the loss of key taxa over a lifetime, enabling potential microbiome-targeted therapeutic strategies to prevent aging.IMPORTANCE Considerable evidence suggests that the gut microbiome changes with age or even accelerates aging in adults. Whether the age-related changes in the gut microbiome are more or less prominent than those for other body sites and whether predictions can be made about a person's age from a microbiome sample remain unknown. We therefore combined several large studies from different countries to determine which body site's microbiome could most accurately predict age. We found that the skin was the best, on average yielding predictions within 4 years of chronological age. This study sets the stage for future research on the role of the microbiome in accelerating or decelerating the aging process and in the susceptibility for age-related diseases

eScholarship - University of California

Transcriptome characterization and differentially expressed genes under flooding and drought stress in the biomass grasses Phalaris arundinacea and Dactylis glomerata

Author: Arojju Sai Krishna
Barth Susanne
Cormican Paul
Finnan John
Grant Jim
Haiminen Niina
Klaas Manfred
Parida Laxmi
Utro Filippo
Vellani Tia
Publication venue: 'Oxford University Press (OUP)'
Publication date: 26/06/2019
Field of study

peer-reviewedBackground and Aims Perennial grasses are a global resource as forage, and for alternative uses in bioenergy and as raw materials for the processing industry. Marginal lands can be valuable for perennial biomass grass production, if perennial biomass grasses can cope with adverse abiotic environmental stresses such as drought and waterlogging. Methods In this study, two perennial grass species, reed canary grass (Phalaris arundinacea) and cocksfoot (Dactylis glomerata) were subjected to drought and waterlogging stress to study their responses for insights to improving environmental stress tolerance. Physiological responses were recorded, reference transcriptomes established and differential gene expression investigated between control and stress conditions. We applied a robust non-parametric method, RoDEO, based on rank ordering of transcripts to investigate differential gene expression. Furthermore, we extended and validated vRoDEO for comparing samples with varying sequencing depths. Key Results This allowed us to identify expressed genes under drought and waterlogging whilst using only a limited number of RNA sequencing experiments. Validating the methodology, several differentially expressed candidate genes involved in the stage 3 step-wise scheme in detoxification and degradation of xenobiotics were recovered, while several novel stress-related genes classified as of unknown function were discovered. Conclusions Reed canary grass is a species coping particularly well with flooding conditions, but this study adds novel information on how its transcriptome reacts under drought stress. We built extensive transcriptomes for the two investigated C3 species cocksfoot and reed canary grass under both extremes of water stress to provide a clear comparison amongst the two species to broaden our horizon for comparative studies, but further confirmation of the data would be ideal to obtain a more detailed picture.FP7 grant GrassMargin

T-Stór

Determining significance of pairwise co-occurrences of events in bursty sequences

Author: A Sandelin
Evimaria Terzi
GD Stormo
H Chen
H Klein
H Mannila
Heikki Mannila
International Human Genome Sequencing Consortium
K Rateitschak
LJ Wood
M Blanchette
M Decoville
M Stepanova
Niina Haiminen
NK Mukhopadhyay
S Hannenhalli
S Levy
V Matys
VJ Makeev
Y Benjamini
Y Quan
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Event sequences where different types of events often occur close together arise, e.g., when studying potential transcription factor binding sites (TFBS, events) of certain transcription factors (TF, types) in a DNA sequence. These events tend to occur in bursts: in some genomic regions there are more genes and therefore potentially more binding sites, while in some, possibly very long regions, hardly any events occur. Also some types of events may occur in the sequence more often than others. Tendencies of co-occurrence of binding sites of two or more TFs are interesting, as they may imply a co-operative role between the TFs in regulatory processes. Determining a numerical value to summarize the tendency for co-occurrence between two TFs can be done in a number of ways. However, testing for the significance of such values should be done with respect to a relevant null model that takes into account the global sequence structure. Results We extend the existing techniques that have been considered for determining the significance of co-occurrence patterns between a pair of event types under different null models. These models range from very simple ones to more complex models that take the burstiness of sequences into account. We evaluate the models and techniques on synthetic event sequences, and on real data consisting of potential transcription factor binding sites. Conclusion We show that simple null models are poorly suited for bursty data, and they yield many false positives. More sophisticated models give better results in our experiments. We also demonstrate the effect of the window size, i.e., maximum co-occurrence distance, on the significance results.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient computation of Faith's phylogenetic diversity with applications in characterizing microbiomes

Author: Armstrong George
Beck Kristen L.
Cantrell Kalen
Carrieri Anna Paola
Gonzalez Antonio
Haiminen Niina
Hakim Daniel
Havulinna Aki S.
Huang Shi
Inouye Michael
Jain Mohit
Kim Ho-Cheol
Knight Rob
Lahti Leo
McDonald Daniel
McGrath Imran
Meric Guillaume
Niiranen Teemu
Parida Laxmi
Salomaa Veikko
Swafford Austin D.
Vazquez-Baeza Yoshiki
Zhu Qiyun
Publication venue
Publication date: 01/11/2021
Field of study

The number of publicly available microbiome samples is continually growing. As data set size increases, bottlenecks arise in standard analytical pipelines. Faith's phylogenetic diversity (Faith's PD) is a highly utilized phylogenetic alpha diversity metric that has thus far failed to effectively scale to trees with millions of vertices. Stacked Faith's phylogenetic diversity (SFPhD) enables calculation of this widely adopted diversity metric at a much larger scale by implementing a computationally efficient algorithm. The algorithm reduces the amount of computational resources required, resulting in more accessible software with a reduced carbon footprint, as compared to previous approaches. The new algorithm produces identical results to the previous method. We further demonstrate that the phylogenetic aspect of Faith's PD provides increased power in detecting diversity differences between younger and older populations in the FINRISK study's metagenomic data.Peer reviewe

PubMed Central

eScholarship - University of California

Helsingin yliopiston digitaalinen arkisto

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

Author: Armstrong George
Belda-Ferre Pedro
Das Promi
Gilbert Jack A.
Gonzalez Antonio
Haiminen Niina
Havulinna Aki S.
Huang Shi
Inouye Michael
Jain Mohit
Kim Ho-Cheol
Knight Rob
Kuczynski Justin
Lahti Leo
Lejzerowicz Franck
McDonald Daniel
McGrath Imran
Meric Guillaume
Niiranen Teemu
Salomaa Veikko
Sepich-Poore Gregory D.
Shaffer Justin P.
Swafford Austin D.
Vazquez-Baeza Yoshiki
Yu Julian
Zhu Qiyun
Publication venue
Publication date: 01/04/2022
Field of study

We introduce the operational genomic unit (OGU) method, a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent of taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance, and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldom applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome data sets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project data set and more accurate prediction of human age by the gut microbiomes of Finnish individuals included in the FINRISK 2002 cohort. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate adoption of the OGU method in future metagenomics studies. IMPORTANCE Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution. To solve these challenges, we introduce operational genomic units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition and (ii) permitting use of phylogeny-aware tools. Our analysis of real-world data sets shows that it is advantageous over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGUs as an effective practice in metagenomic studies.Peer reviewe

PubMed Central

eScholarship - University of California

Helsingin yliopiston digitaalinen arkisto

Apollo (Cambridge)

Evaluation of Methods for De Novo Genome Assembly from High-Throughput Sequencing Reads Reveals Dependencies That Affect the Quality of the Results

Author: Andrey Rzhetsky
CS Keith
D Hernandez
David N. Kuhn
DM Church
DR Zerbino
F Sanger
G Narzisi
Isidore Rigoutsos
J Shendure
JA Reinhardt
JC Dohm
JR Miller
JR Miller
JT Simpson
K Mavromatis
Laxmi Parida
MJ Chaisson
ML Metzker
Niina Haiminen
R Blakesley
R Cronn
R Li
R Li
S Altschul
S DiGuistini
S Gnerre
S Gnerre
S Ossowski
S Rounsley
SL Salzberg
W Zhang
WR Jeck
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

Author: Armstrong George
Belda-Ferre Pedro
Das Promi
Gilbert Jack A.
Gonzalez Antonio
Haiminen Niina
Havulinna Aki S.
Huang Shi
Inouye Michael
Jain Mohit
Kim Ho-Cheol
Knight Rob
Kuczynski Justin
Lahti Leo
Lejzerowicz Franck
McDonald Daniel
McGrath Imran
Méric Guillaume
Niiranen Teemu
Salomaa Veikko
Sepich-Poore Gregory D.
Shaffer Justin P.
Swafford Austin D.
Vázquez-Baeza Yoshiki
Yu Julian
Zhu Qiyun
Publication venue: 'American Society for Microbiology'
Publication date: 28/10/2022
Field of study

We introduce the operational genomic unit (OGU) method, a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent of taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance, and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldom applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome data sets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project data set and more accurate prediction of human age by the gut microbiomes of Finnish individuals included in the FINRISK 2002 cohort. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate adoption of the OGU method in future metagenomics studies.IMPORTANCE Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution. To solve these challenges, we introduce operational genomic units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition and (ii) permitting use of phylogeny-aware tools. Our analysis of real-world data sets shows that it is advantageous over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGUs as an effective practice in metagenomic studies.</p

UTUPub