78 research outputs found
Estimage: a webserver hub for the computation of methylation age
Methylage is an epigenetic marker of biological age that exploits the correlation between the methylation state of specific CG dinucleotides (CpGs) and chronological age (in years), gestational age (in weeks), cellular age (in cell cycles or as telomere length, in kilobases). Using DNA methylation data, methylage is measurable via the so called epigenetic clocks. Importantly, alterations of the correlation between methylage and age (age acceleration or deceleration) have been stably associated with pathological states and occur long before clinical signs of diseases become overt, making epigenetic clocks a potentially disruptive tool in preventive, diagnostic and also in forensic applications. Nevertheless, methylage dependency from CpGs selection, mathematical modelling, tissue specificity and age range, still makes the potential of this biomarker limited. In order to enhance model comparisons, interchange, availability, robustness and standardization, we organized a selected set of clocks within a hub webservice, EstimAge (Estimate of methylation Age, http://estimage.iac.rm.cnr.it), which intuitively and informatively enables quick identification, computation and comparison of available clocks, with the support of standard statistics
Editorial: Computational Methods for Analysis of DNA Methylation Data
DNA methylation is among the most studied epigenetic modifications in eukaryotes. The interest in DNA methylation stems from its role in development, as well as its well- established association with phenotypic changes. Particularly, there is strong evidence that methylation pattern alterations in mammals are linked to developmental disorders and cancer (Kulis and Esteller, 2010). Owing to its potential as a prognostic marker for preventive medicine, in recent years, the analysis of DNA methylation data has garnered interest in many different contexts of computational biology (Bock, 2012). As it typically happens with omic data, processing, analyzing and interpreting large-scale DNA methylation datasets requires computational methods and software tools that address multiple challenges. In the present Research Topic, we collected papers that tackle different aspects of computational approaches for the analysis of DNA methylation data. These manuscripts address novel computational solutions for copy number variation detection, cell-type deconvolution and methylation pattern imputation, while others discuss interpretations of well-established computational techniques.
Over the last 10 years, DNA methylation profiles have been successfully exploited to develop biomarkers of age, also referred to as epigenetic clocks (Bell et al., 2019). Epigenetic clocks accurately estimate both chronological and biological age from methylation levels. DNA methylation age and, most importantly, its deviation from chronological age have been shown to be associated with a variety of health issues. More recently, a second generation of epigenetic clocks has emerged. The new generation of clocks incorporates not only methylation profiles but also environmental variants, such as smoking and alcohol consumption, and they outperform the first generation in mortality prediction and prognosis of certain diseases. In our collection, the review by Chen et al. compares the first and second generation of epigenetic clocks that predict cancer risk and discusses pathways known to exhibit altered methylation in aging tissues and cancer.
Differentially methylated regions (DMRs), that is genomic regions that show significant differences in methylation levels across distinct biological and/or medical conditions (e.g., normal vs. disease), have been reported to be implicated in a variety of disorders (Rakyan et al., 2011). As a result, identifying DMRs is one of the most critical and fundamental challenges in deciphering disease mechanisms at the molecular level. Although DNA methylation patterns remain stable during normal somatic cell growth, alterations in genomic methylation may be caused by genetic alterations, or vice versa. However, standard DMR analysis often ignores whether methylation alterations should be viewed as a cause or an effect. Rhamani et al. discuss the effect of model directionality, i.e. whether the condition of interest (phenotype) may be affected by methylation or whether it may affect methylation, in differential methylation analyses at the cell-type level. They show that correctly accounting for model directionality has a significant impact on the ability to identify cell type specific differential methylation.
Different cell types exhibit DMRs at many genomic regions and such rich information can be exploited to infer underlying cell type proportions using deconvolution techniques. DNA methylation-based cell mixture deconvolution approaches can be classified into two main categories: reference-based and reference-free. While the latter are more broadly applicable, as they do not rely on the availability of methylation profiles from each of the purified cell types that compose a tissue of interest, they are also less precise. Reference-based approaches use DMRs specific to cell types (reference library) to determine the underlying cellular composition within a DNA methylation sample. The quality of the reference library has a big impact on the accuracy of reference-based approaches. Bell-Glenn et al. present RESET, a framework for reference library selection for deconvolution algorithms exploiting a modified version of the Dispersion Separability Criteria score, for the inference of the best DMRs composing the library, contributing to de facto standards (Koestler et al., 2016). In short, RESET does not require researchers to identify a priori the size of the reference library (number of DMRs), nor to rely on costly associated purified cells’ mDNA profiles.
Within a cellular population, the methylation patterns of different cell types and at specific genomic locations are indicative of cellular heterogeneity. Alterations of such heterogeneity are predictive of development as well as prognostic markers of diseases. Computational methods that exploit heterogeneity in methylation patterns are typically constrained by partially observed patterns due to the nature of shotgun sequencing, which frequently generates limited coverage for downstream analysis. One possible solution to overcome such limitations is offered by Chang et al. presenting BSImp, a probabilistic based imputation method that uses local information to impute partially observed methylation patterns. They show that using this approach they are able to recover heterogeneity estimates at 15% more regions with moderate sequencing depths. This should therefore improve our ability to study how methylation heterogeneity is associated with disease.
Finally, recent studies have shown how the associations between Copy Number Variations (CNVs) and methylation alterations offer a richer and hence more informative picture of the samples under study, in particular for tumor data characterized by large scale genomic rearrangements (Sun et al., 2018). Consequently, recent technological and methodological developments have enabled the possibility to measure CNVs from DNA methylation data. The main advantage of DNA methylation based CNV approaches is that they offer the possibility to integrate both genomic (copy number) and epigenomic (methylation) information. Mariani et al. propose MethylMasteR, an R software package that integrates DNA methylation-based CNV calling routines, facilitating standardization, comparison and customization of CNV analyses. This package, built into the Docker architecture to seamlessly mange dependencies, includes four of the most commonly used routines for this integrated analysis, ChAMP (Morris et al., 2014), SeSAMe (Zhou et al., 2018), Epicopy (Cho et al., 2019), plus a custom version of cnAnalysis450k (Knoll et al., 2017), overall enabling analysis of comparative results.
All the topics in this issue, although limited to specific aspects of DNA methylation analysis, highlight the importance of research in this field, the associated computational challenges and illustrate the significant impact that this type of data will likely have on preventive medicine
Multilevel omic data integration in cancer cell lines: advanced annotation and emergent properties
BACKGROUND: High-throughput (omic) data have become more widespread in both quantity and frequency of use, thanks to technological advances, lower costs and higher precision. Consequently, computational scientists are confronted by two parallel challenges: on one side, the design of efficient methods to interpret each of these data in their own right (gene expression signatures, protein markers, etc.) and, on the other side, realization of a novel, pressing request from the biological field to design methodologies that allow for these data to be interpreted as a whole, i.e. not only as the union of relevant molecules in each of these layers, but as a complex molecular signature containing proteins, mRNAs and miRNAs, all of which must be directly associated in the results of analyses that are able to capture inter-layers connections and complexity. RESULTS: We address the latter of these two challenges by testing an integrated approach on a known cancer benchmark: the NCI-60 cell panel. Here, high-throughput screens for mRNA, miRNA and proteins are jointly analyzed using factor analysis, combined with linear discriminant analysis, to identify the molecular characteristics of cancer. Comparisons with separate (non-joint) analyses show that the proposed integrated approach can uncover deeper and more precise biological information. In particular, the integrated approach gives a more complete picture of the set of miRNAs identified and the Wnt pathway, which represents an important surrogate marker of melanoma progression. We further test the approach on a more challenging patient-dataset, for which we are able to identify clinically relevant markers. CONCLUSIONS: The integration of multiple layers of omics can bring more information than analysis of single layers alone. Using and expanding the proposed integrated framework to integrate omic data from other molecular levels will allow researchers to uncover further systemic information. The application of this approach to a clinically challenging dataset shows its promising potential
Joint analysis of transcriptional and post- transcriptional brain tumor data: searching for emergent properties of cellular systems
<p>Abstract</p> <p>Background</p> <p>Advances in biotechnology offer a fast growing variety of high-throughput data for screening molecular activities of genomic, transcriptional, post-transcriptional and translational observations. However, to date, most computational and algorithmic efforts have been directed at mining data from each of these molecular <it>levels </it>(genomic, transcriptional, etc.) separately. In view of the rapid advances in technology (new generation sequencing, high-throughput proteomics) it is important to address the problem of analyzing these data as a whole, i.e. preserving the emergent properties that appear in the cellular system when all molecular levels are interacting. We analyzed one of the (currently) few datasets that provide both transcriptional and post-transcriptional data of the same samples to investigate the possibility to extract more information, using a joint analysis approach.</p> <p>Results</p> <p>We use Factor Analysis coupled with pre-established knowledge as a theoretical base to achieve this goal. Our intention is to identify structures that contain information from both mRNAs and miRNAs, and that can explain the complexity of the data. Despite the small sample available, we can show that this approach permits identification of meaningful structures, in particular two polycistronic miRNA genes related to transcriptional activity and likely to be relevant in the discrimination between gliosarcomas and other brain tumors.</p> <p>Conclusions</p> <p>This suggests the need to develop methodologies to simultaneously mine information from different levels of biological organization, rather than linking separate analyses performed in parallel.</p
Methylation data imputation performances under different representations and missingness patterns
Background: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms –(completely) at random or not- and different representations of DNA methylation levels (β and M-value).
Results: We make an extensive analysis of the imputation performances of seven imputation methods on simulated missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) methylation data. We further consider imputation performances on the popular β- and M-value representations of methylation levels. Overall, β-values enable better imputation performances than M- values. Imputation accuracy is lower for mid-range β-values, while it is generally more accurate for values at the extremes of the β-value range. The MAR values distribution is on the average more dense in the mid-range in comparison to the expected β-value distribution. As a consequence, MAR values are on average harder to impute.
Conclusions: The results of the analysis provide guidelines for the most suitable imputation approaches for DNA methylation data under different representations of DNA methylation levels and different missing data mechanisms
A Comprehensive Molecular Interaction Map for Rheumatoid Arthritis
Computational biology contributes to a variety of areas related to life sciences and, due to the growing impact of translational medicine - the scientific approach to medicine in tight relation with basic science -, it is becoming an important player in clinical-related areas. In this study, we use computation methods in order to improve our understanding of the complex interactions that occur between molecules related to Rheumatoid Arthritis (RA).Due to the complexity of the disease and the numerous molecular players involved, we devised a method to construct a systemic network of interactions of the processes ongoing in patients affected by RA. The network is based on high-throughput data, refined semi-automatically with carefully curated literature-based information. This global network has then been topologically analysed, as a whole and tissue-specifically, in order to translate the experimental molecular connections into topological motifs meaningful in the identification of tissue-specific markers and targets in the diagnosis, and possibly in the therapy, of RA.’
Electrostimulation of a 3D in vitro skin model to activate wound healing
The aim of the work is to propose a methodology for the stimulation of a 3D in vitro skin model to activate wound healing. The presented work is in the frame of the national research project, CronXCov, “Checking the CHRONIC to prevent COVID-19”, devoted to understand how physiologic and inflamed skin on chip 3D models evolve upon a range of physical (e.g., electrical, mechanical, optical) stimulations, over time. Thanks to the 3D modelling, using Next Generation Sequencing and the network medicine frame of analysis to process the data, we will systematically characterize the effects of the applied stimuli, offering new insight for the exploitation of wound healing
Circuits and Systems for High-Throughput Biology
The beginning of this millennium has been marked by some remarkable scientific events, notably the completion of the first objective of the Human Genome Project [HGP], i.e., the decoding of the 3 billion bases that compose the human genome. This success has been made possible by the advancement of bio-engineering, data processing and the collaboration of scientists from academic institutions and private companies in many countries. The availability of biological information through web-accessible open databases has stirred further research and enthusiasm. More interestingly, this has changed the way in which molecular biology is approached today, since the newly available large amount of data require the tight interaction between information technology and life science, in a way not appreciated before. Still much has to be accomplished, to realize the potential impact of using the knowledge that we have acquired. Several grand challenges are still open, such as diagnosing and treatment of a number of diseases, understanding details of the complex mechanisms that regulate life, predicting and controlling the evolution of several biological processes. Nevertheless, there is now unprecendent room to reach these objectives, because the underlying technologies that we master have been exploited only to a limited extent. High-throughput biological data acquisition and processing technologies have shifted the focus of biological research from the realm of traditional experimental science (wet biology) to that of information science (in silico biology). Powerful computation and communication means can be applied to the very large amount of apparently incoherent data coming from biomedical research. The technical challenges that lie ahead include the interfacing between the information in biological samples and information and its abstraction in terms of mathematical models and binary data that computer engineers are used to handle. For example, how can we automate costly, repetitive and time consuming processes for the analysis of data that must cover the information contained in a whole organism genome? How can we design a drug that triggers a specific answer? Anyone wearing the hat of a Circuit and System engineer would immediately realize that one important issue is the interfacing of the biological to the electrical world, which is often realized by microscopic probes, able to capture and manipulate bio-materials at the molecular level. A portion of the costly and time consuming experiments and tests that we used to do in vitro and/or in vivo, can now be done in silico. The concept of Laboratory (Lab) on Chip (LoC) is the natural evolution of System on Chip (SoC) by using an array of heterogeneous technologies. Whether LoCs will be realized on a monolithic chip or as a combination of modules is just a technicality. The revolution brought by Labs on Chips is related to the rationalization of bio-analysis, the drastic reduction of sample quantities, and its portability to various environments. We have witnessed the widespread distribution of complex electronic systems due to their low manufacturing costs. Also in this case, LoC costs will be key to their acceptance. But it is easy to foresee that LoCs may be mass produced, with post-silicon manufacturing technologies, where large production volumes correlate to competitive costs. At the same time, the reduction of size, weight and human intervention will limit operating costs and make LoCs competitive. Labs on Chips at medical points of care will fulfill the desire of fast and more accurate diagnosis. Moreover diagnosis at home and/or at mass transit facilities (e.g., airports) can have a significant impact on the overall population health. LoCs for processing environmental data (e.g., pollution) may be coupled with wireless sensor networks to better monitor the planet. The use of the information produced by the Human Genome Project (marking the beginning of the Genomic Era) and its further refinement and understanding (post-Genomic Era), as well as the consequences related to moral and legal implication for the betterment of society has just started. In fact, the decoding of the Human Genome paved the way to a different approach to molecular biology, in that it is now possible to observe the interrelations among whole bodies of molecules such as genes, proteins, transcripts, metabolites in parallel (the so called omic data like genomes, proteomes, transcriptomes, metabolomes etc.), rather than observe and characterize a single chain of a cascade of events (i.e. perform genomic vs genetic analyses). In other words, molecular biology underwent an important shift in the paradigm of research, from a reductionist to a more systemic approach (systems biology) for which models developed in engineering will be of primary importance
Discovering Coherent Biclusters from Gene Expression Data Using Zero-Suppressed Binary Decision Diagrams
The biclustering method can be a very useful analysis tool when some genes have multiple functions and experimental conditions are diverse in gene expression measurement. This is because the biclustering approach, in contrast to the conventional clustering techniques, focuses on finding a subset of the genes and a subset of the experimental conditions that together exhibit coherent behavior. However, the biclustering problem is inherently intractable, and it is often computationally costly to find biclusters with high levels of coherence. In this work, we propose a novel biclustering algorithm that exploits the zero-suppressed binary decision diagrams (ZBDDs) data structure to cope with the computational challenges. Our method can find all biclusters that satisfy specific input conditions, and it is scalable to practical gene expression data. We also present experimental results confirming the effectiveness of our approach
- …