48 research outputs found
Algorithms for internal validation clustering measures in the post genomic era.
Inferring cluster structure in microarray datasets is a fundamental task for
the -omic sciences. A fundamental question in Statistics, Data Analysis and
Classification, is the prediction of the number of clusters in a dataset,
usually established via internal validation measures. Despite the wealth of
internal measures available in the literature, new ones have been recently
proposed, some of them specifically for microarray data. In this dissertation,
a study of internal validation measures is given, paying particular attention
to the stability based ones. Indeed, this class of measures is particularly
prominent and promising in order to have a reliable estimate the number of
clusters in a dataset. For those measures, a new general algorithmic paradigm
is proposed here that highlights the richness of measures in this class and
accounts for the ones already available in the literature. Moreover, some of
the most representative validation measures are also considered. Experiments on
12 benchmark datasets are performed in order to assess both the intrinsic
ability of a measure to predict the correct number of clusters in a dataset and
its merit relative to the other measures. The main result is a hierarchy of
internal validation measures in terms of precision and speed, highlighting some
of their merits and limitations not reported before in the literature. This
hierarchy shows that the faster the measure, the less accurate it is. In order
to reduce the time performance gap between the fastest and the most precise
measures, the technique of designing fast approximation algorithms is
systematically applied. The end result is a speed-up of many of the measures
studied here that brings the gap between the fastest and the most precise
within one order of magnitude in time, with no degradation in their prediction
power. Prior to this work, the time gap was at least two orders of magnitude
Essential Simplices in Persistent Homology and Subtle Admixture Detection
We introduce a robust mathematical definition of the notion of essential elements in a basis of the homology space and prove that these elements are unique. Next we give a novel visualization of the essential elements of the basis of the homology space through a rainfall-like plot (RFL). This plot is data-centric, i.e., is associated with the individual samples of the data, as opposed to the structure-centric barcodes of persistent homology. The proof-of-concept was tested on data generated by SimRA that simulates different admixture scenarios. We show that the barcode analysis can be used not just to detect the presence of admixture but also estimate the number of admixed populations. We also demonstrate that data-centric RFL plots have the potential to further disentangle the common history into admixture events and relative timing of the events, even in very complex scenarios
Sampling ARG of multiple populations under complex configurations of subdivision and admixture.
Abstract
Motivation: Simulating complex evolution scenarios of multiple populations is an important task for answering many basic questions relating to population genomics. Apart from the population samples, the underlying Ancestral Recombinations Graph (ARG) is an additional important means in hypothesis checking and reconstruction studies. Furthermore, complex simulations require a plethora of interdependent parameters making even the scenario-specification highly non-trivial.
Results: We present an algorithm SimRA that simulates generic multiple population evolution model with admixture. It is based on random graphs that improve dramatically in time and space requirements of the classical algorithm of single populations.
Using the underlying random graphs model, we also derive closed forms of expected values of the ARG characteristics i.e., height of the graph, number of recombinations, number of mutations and population diversity in terms of its defining parameters. This is crucial in aiding the user to specify meaningful parameters for the complex scenario simulations, not through trial-and-error based on raw compute power but intelligent parameter estimation. To the best of our knowledge this is the first time closed form expressions have been computed for the ARG properties. We show that the expected values closely match the empirical values through simulations.
Finally, we demonstrate that SimRA produces the ARG in compact forms without compromising any accuracy. We demonstrate the compactness and accuracy through extensive experiments.
Availability and implementation: SimRA (Simulation based on Random graph Algorithms) source, executable, user manual and sample input-output sets are available for downloading at: https://github.com/ComputationalGenomics/SimRA
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online
A methodology to assess the intrinsic discriminative ability of a distance function and its interplay with clustering algorithms for microarray data analysis
Background: Clustering is one of the most well known activities in scientific investigation and the object of research in many disciplines, ranging from statistics to computer science. Following Handl et al., it can be summarized as a three step process: (1) choice of a distance function; (2) choice of a clustering algorithm; (3) choice of a validation method. Although such a purist approach to clustering is hardly seen in many areas of science, genomic data require that level of attention, if inferences made from cluster analysis have to be of some relevance to biomedical research. Results: A procedure is proposed for the assessment of the discriminative ability of a distance function. That is, the evaluation of the ability of a distance function to capture structure in a dataset. It is based on the introduction of a new external validation index, referred to as Balanced Misclassification Index (BMI, for short) and of a nontrivial modification of the well known Receiver Operating Curve (ROC, for short), which we refer to as Corrected ROC (CROC, for short). The main results are: (a) a quantitative and qualitative method to describe the intrinsic separation ability of a distance; (b) a quantitative method to assess the performance of a clustering algorithm in conjunction with the intrinsic separation ability of a distance function. The proposed procedure is more informative than the ones available in the literature due to the adopted tools. Indeed, the first one allows to map distances and clustering solutions as graphical objects on a plane, and gives information about the bias of the clustering algorithm with respect to a distance. The second tool is a new external validity index which shows similar performances with respect to the state of the art, but with more flexibility, allowing for a broader spectrum of applications. In fact, it allows not only to quantify the merit of each clustering solution but also to quantify the agglomerative or divisive errors due to the algorithm. Conclusions: The new methodology has been used to experimentally study three popular distance functions, namely, Euclidean distance d2, Pearson correlation dr and mutual information dMI. Based on the results of the experiments, we have that the Euclidean and Pearson correlation distances have a good intrinsic discrimination ability. Conversely, the mutual information distance does not seem to offer the same flexibility and versatility as the other two distances. Apparently, that is due to well known problems in its estimation. since it requires that a dataset must have a substantial number of features to be reliable. Nevertheless, taking into account such a fact, together with results presented in Priness et al., one receives an indication that dMI may be superior to the other distances considered in this study only in conjunction with clustering algorithms specifically designed for its use. In addition, it results that K-means, Average Link, and Complete link clustering algorithms are in most cases able to improve the discriminative ability of the distances considered in this study with respect to clustering. The methodology has a range of applicability that goes well beyond microarray data since it is independent of the nature of the input data. The only requirement is that the input data must have the same format of a "feature matrix". In particular it can be used to cluster ChIP-seq data
Functional Information, Biomolecular Messages and Complexity of BioSequences and Structures
In the quest for a mathematical measure able to capture and shed light on the dual notions of information and complexity in biosequences, Hazen et al. have introduced the notion of Functional Information (FI for short). It is also the result of earlier considerations and findings by Szostak and Carothers et al. Based on the experiments by Charoters et al., regarding FI in RNA binding activities, we decided to study the relation existing between FI and classic measures of complexity applied on protein-DNA interactions on a genome-wide scale. Using classic complexity measures, i.e, Shannon entropy and Kolmogorov Complexity as both estimated by data compression, we found that FI applied to protein-DNA interactions is genuinely different from them. Such a fact, together with the non-triviality of the biological function considered, contributes to the establishment of FI as a novel and useful measure of biocomplexity. Remarkably, we also found a relationship, on a genome-wide scale, between the redundancy of a genomic region and its ability to interact with a protein. This latter finding justifies even more some principles for the design of motif discovery algorithms. Finally, our experiments bring to light methodological limitations of Linguistic Complexity measures, i.e., a class of measures that is a function of the vocabulary richness of a sequence. Indeed, due to the technology and associated statistical preprocessing procedures used to conduct our studies, i.e., genome-wide ChIP-chip experiments, that class of measures cannot give any statistically significant indication about complexity and function. A serious limitation due to the widespread use of the technology.
References
J.M. Carothers, S.C. Oestreich, J.H. Davis, and J.W. Szostack. Informational complexity and functional activity of RNA structures. J. AM. CHEM. SOC., 126 (2004), pp. 5130-5137.
R.M. Hazen, P.L. Griffin, J.M. Carothers, and J.W. Szostak. Functional Information and the emergence of biocomplexity. Proc. of Nat. Acad. Sci, 104 (2007), pp. 8574-8581.
J.W. Szostak. Functional Information: molecular messages, Nature, 423 (2003)
Probing omics data via harmonic persistent homology
Identifying molecular signatures from complex disease patients with
underlying symptomatic similarities is a significant challenge in analysis of
high dimensional multi-omics data. Topological data analysis (TDA) provides a
way of extracting such information from the geometric structure of the data and
identify multiway higher-order relationships. Here, we propose an application
of Harmonic persistent homology which overcomes the limitations of ambiguous
assignment of the topological information to the original elements in a
representative topological cycle from the data. When applied to multi-omics
data, this leads to the discovery of hidden patterns highlighting the
relationships between different omic profiles, while allowing for common tasks
in multiomics analyses such as disease subtyping, and most importantly
biomarker identification for similar latent biological pathways that are
associated with complex diseases. Our experiments on multiple cancer data show
that harmonic persistent homology and TDA can be very useful in dissecting
muti-omics data and identify biomarkers while detecting representative cycles
of the data which also predicts disease subtypes
The genome sequence of the most widely cultivated cacao type and its use to identify candidate genes regulating pod color
Background
Theobroma cacao L. cultivar Matina 1-6 belongs to the most cultivated cacao type. The availability of its genome sequence and methods for identifying genes responsible for important cacao traits will aid cacao researchers and breeders. Results
We describe the sequencing and assembly of the genome of Theobroma cacao L. cultivar Matina
1-6. The genome of the Matina 1-6 cultivar is 445 Mbp, which is significantly larger than a sequenced Criollo cultivar, and more typical of other cultivars. The chromosome-scale assembly, version 1.1, contains 711 scaffolds covering 346.0 Mbp, with a contig N50 of 84.4 kbp, a scaffold N50 of 34.4 Mbp, and an evidence-based gene set of 29,408 loci. Version 1.1 has 10x the scaffold N50 and 4x the contig N50 as Criollo, and includes 111 Mb more anchored sequence. The version 1.1 assembly has 4.4% gap sequence, while Criollo has 10.9%. Through a combination of haplotype, association mapping and gene expression analyses, we leverage this robust reference genome to identify a promising candidate gene responsible for pod color variation. We demonstrate that green/red pod color in cacao is likely regulated by the R2R3 MYB transcription factor TcMYB113, homologs of which determine pigmentation in Rosaceae, Solanaceae, and Brassicaceae. One SNP within the target site for a highly conserved trans-acting siRNA in dicots, found within TcMYB113, seems to affect transcript levels of this gene and therefore pod color variation. Conclusions
We report a high-quality sequence and annotation of Theobroma cacao L. and demonstrate its utility in identifying candidate genes regulating traits
Transcriptome characterization and differentially expressed genes under flooding and drought stress in the biomass grasses Phalaris arundinacea and Dactylis glomerata
peer-reviewedBackground and Aims
Perennial grasses are a global resource as forage, and for alternative uses in bioenergy and as raw materials for the processing industry. Marginal lands can be valuable for perennial biomass grass production, if perennial biomass grasses can cope with adverse abiotic environmental stresses such as drought and waterlogging.
Methods
In this study, two perennial grass species, reed canary grass (Phalaris arundinacea) and cocksfoot (Dactylis glomerata) were subjected to drought and waterlogging stress to study their responses for insights to improving environmental stress tolerance. Physiological responses were recorded, reference transcriptomes established and differential gene expression investigated between control and stress conditions. We applied a robust non-parametric method, RoDEO, based on rank ordering of transcripts to investigate differential gene expression. Furthermore, we extended and validated vRoDEO for comparing samples with varying sequencing depths.
Key Results
This allowed us to identify expressed genes under drought and waterlogging whilst using only a limited number of RNA sequencing experiments. Validating the methodology, several differentially expressed candidate genes involved in the stage 3 step-wise scheme in detoxification and degradation of xenobiotics were recovered, while several novel stress-related genes classified as of unknown function were discovered.
Conclusions
Reed canary grass is a species coping particularly well with flooding conditions, but this study adds novel information on how its transcriptome reacts under drought stress. We built extensive transcriptomes for the two investigated C3 species cocksfoot and reed canary grass under both extremes of water stress to provide a clear comparison amongst the two species to broaden our horizon for comparative studies, but further confirmation of the data would be ideal to obtain a more detailed picture.FP7 grant GrassMargin
Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licens