261 research outputs found
Minimum error correction-based haplotype assembly: considerations for long read data
The single nucleotide polymorphism (SNP) is the most widely studied type of
genetic variation. A haplotype is defined as the sequence of alleles at SNP
sites on each haploid chromosome. Haplotype information is essential in
unravelling the genome-phenotype association. Haplotype assembly is a
well-known approach for reconstructing haplotypes, exploiting reads generated
by DNA sequencing devices. The Minimum Error Correction (MEC) metric is often
used for reconstruction of haplotypes from reads. However, problems with the
MEC metric have been reported. Here, we investigate the MEC approach to
demonstrate that it may result in incorrectly reconstructed haplotypes for
devices that produce error-prone long reads. Specifically, we evaluate this
approach for devices developed by Illumina, Pacific BioSciences and Oxford
Nanopore Technologies. We show that imprecise haplotypes may be reconstructed
with a lower MEC than that of the exact haplotype. The performance of MEC is
explored for different coverage levels and error rates of data. Our simulation
results reveal that in order to avoid incorrect MEC-based haplotypes, a
coverage of 25 is needed for reads generated by Pacific BioSciences RS systems.Comment: 17 pages, 6 figure
In silico assessment of a novel single-molecule protein fingerprinting method employing fragmentation and nanopore detection
Summary: The identification of proteins at the single-molecule level would open exciting new venues in biological research and disease diagnostics. Previously, we proposed a nanopore-based method for protein identification called chop-n-drop fingerprinting, in which the fragmentation pattern induced and measured by a proteasome-nanopore construct is used to identify single proteins. In the simulation study presented here, we show that 97.1% of human proteome constituents are uniquely identified under close to ideal measuring circumstances, using a simple alignment-based classification method. We show that our method is robust against experimental error, as 69.4% can still be identified if the resolution is twice as low as currently attainable, and 10% of proteasome restriction sites and protein fragments are randomly ignored. Based on these results and our experimental proof of concept, we argue that chop-n-drop fingerprinting has the potential to make cost-effective single-molecule protein identification feasible in the near future
Caretta – A multiple protein structure alignment and feature extraction suite
The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.</p
Cnidaria: fast, reference-free clustering of raw and assembled genome and transcriptome NGS data
Background: Identification of biological specimens is a major requirement for
a range of applications. Reference-free methods analyse unprocessed sequencing
data without relying on prior knowledge, but generally do not scale to
arbitrarily large genomes and arbitrarily large phylogenetic distances.
Results: We present Cnidaria, a practical tool for clustering genomic and
transcriptomic data with no limitation on genome size or phylogenetic
distances. We successfully simultaneously clustered 169 genomic and
transcriptomic datasets from 4 kingdoms, achieving 100% identification accuracy
at supra-species level and 78% accuracy for species level. Discussion: CNIDARIA
allows for fast, resource-efficient comparison and identification of both raw
and assembled genome and transcriptome data. This can help answer both
fundamental (e.g. in phylogeny, ecological diversity analysis) and practical
questions (e.g. sequencing quality control, primer design).Comment: 47 pages, 13 figure
Topology of molecular interaction networks
Abstract Molecular interactions are often represented as network models which have become the common language of many areas of biology. Graphs serve as convenient mathematical representations of network models and have themselves become objects of study. Their topology has been intensively researched over the last decade after evidence was found that they share underlying design principles with many other types of networks. Initial studies suggested that molecular interaction network topology is related to biological function and evolution. However, further whole-network analyses did not lead to a unified view on what this relation may look like, with conclusions highly dependent on the type of molecular interactions considered and the metrics used to study them. It is unclear whether global network topology drives function, as suggested by some researchers, or whether it is simply a byproduct of evolution or even an artefact of representing complex molecular interaction networks as graphs. Nevertheless, network biology has progressed significantly over the last years. We review the literature, focusing on two major developments. First, realizing that molecular interaction networks can be naturally decomposed into subsystems (such as modules and pathways), topology is increasingly studied locally rather than globally. Second, there is a move from a descriptive approach to a predictive one: rather than correlating biological network 1 topology to generic properties such as robustness, it is used to predict specific functions or phenotypes. Taken together, this change in focus from globally descriptive to locally predictive points to new avenues of research. In particular, multi-scale approaches are developments promising to drive the study of molecular interaction networks further
Family-Based Haplotype Estimation and Allele Dosage Correction for Polyploids Using Short Sequence Reads
DNA sequence reads contain information about the genomic variants located on a single chromosome. By extracting and extending this information using the overlaps between the reads, the haplotypes of an individual can be obtained. Using parent-offspring relationships in a population can considerably improve the quality of the haplotypes obtained from short reads, as pedigree information can be used to correct for spurious overlaps (due to sequencing errors) and insufficient overlaps (due to short read lengths, low genomic variation and shallow coverage). We developed a novel method, PopPoly, to estimate polyploid haplotypes in an F1-population from short sequence data by taking into consideration the transmission of the haplotypes from the parents to the offspring. In addition, this information is employed to improve genotype dosage estimation and to call missing genotypes in the population. Through simulations, we compare PopPoly to other haplotyping methods and show its better performance. We evaluate PopPoly by applying it to a tetraploid potato cross at nine genomic regions involved in tuber formation
Genomic prediction in plants: opportunities for ensemble machine learning based approaches [version 2; peer review: 1 approved, 2 approved with reservations]
Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners
Shifts in growth strategies reflect tradeoffs in cellular economics
The growth rate-dependent regulation of cell size, ribosomal content, and metabolic efficiency follows a common pattern in unicellular organisms: with increasing growth rates, cell size and ribosomal content increase and a shift to energetically inefficient metabolism takes place. The latter two phenomena are also observed in fast growing tumour cells and cell lines. These patterns suggest a fundamental principle of design. In biology such designs can often be understood as the result of the optimization of fitness. Here we show that in basic models of self-replicating systems these patterns are the consequence of maximizing the growth rate. Whereas most models of cellular growth consider a part of physiology, for instance only metabolism, the approach presented here integrates several subsystems to a complete self-replicating system. Such models can yield fundamentally different optimal strategies. In particular, it is shown how the shift in metabolic efficiency originates from a tradeoff between investments in enzyme synthesis and metabolic yields for alternative catabolic pathways. The models elucidate how the optimization of growth by natural selection shapes growth strategies
- …