1,091 research outputs found
Ensemble Analysis of Adaptive Compressed Genome Sequencing Strategies
Acquiring genomes at single-cell resolution has many applications such as in
the study of microbiota. However, deep sequencing and assembly of all of
millions of cells in a sample is prohibitively costly. A property that can come
to rescue is that deep sequencing of every cell should not be necessary to
capture all distinct genomes, as the majority of cells are biological
replicates. Biologically important samples are often sparse in that sense. In
this paper, we propose an adaptive compressed method, also known as distilled
sensing, to capture all distinct genomes in a sparse microbial community with
reduced sequencing effort. As opposed to group testing in which the number of
distinct events is often constant and sparsity is equivalent to rarity of an
event, sparsity in our case means scarcity of distinct events in comparison to
the data size. Previously, we introduced the problem and proposed a distilled
sensing solution based on the breadth first search strategy. We simulated the
whole process which constrained our ability to study the behavior of the
algorithm for the entire ensemble due to its computational intensity. In this
paper, we modify our previous breadth first search strategy and introduce the
depth first search strategy. Instead of simulating the entire process, which is
intractable for a large number of experiments, we provide a dynamic programming
algorithm to analyze the behavior of the method for the entire ensemble. The
ensemble analysis algorithm recursively calculates the probability of capturing
every distinct genome and also the expected total sequenced nucleotides for a
given population profile. Our results suggest that the expected total sequenced
nucleotides grows proportional to of the number of cells and
proportional linearly with the number of distinct genomes
SNPredict: A Machine Learning Approach for Detecting Low Frequency Variants in Cancer
Cancer is a genetic disease caused by the accumulation of DNA variants such as single nucleotide changes or insertions/deletions in DNA. DNA variants can cause silencing of tumor suppressor genes or increase the activity of oncogenes. In order to come up with successful therapies for cancer patients, these DNA variants need to be identified accurately. DNA variants can be identified by comparing DNA sequence of tumor tissue to a non-tumor tissue by using Next Generation Sequencing (NGS) technology. But the problem of detecting variants in cancer is hard because many of these variant occurs only in a small subpopulation of the tumor tissue. It becomes a challenge to distinguish these low frequency variants from sequencing errors, which are common in today\u27s NGS methods. Several algorithms have been made and implemented as a tool to identify such variants in cancer. However, it has been previously shown that there is low concordance in the results produced by these tools. Moreover, the number of false positives tend to significantly increase when these tools are faced with low frequency variants. This study presents SNPredict, a single nucleotide polymorphism (SNP) detection pipeline that aims to utilize the results of multiple variant callers to produce a consensus output with higher accuracy than any of the individual tool with the help of machine learning techniques. By extracting features from the consensus output that describe traits associated with an individual variant call, it creates binary classifiers that predict a SNP’s true state and therefore help in distinguishing a sequencing error from a true variant
The context-dependence of mutations: a linkage of formalisms
Defining the extent of epistasis - the non-independence of the effects of
mutations - is essential for understanding the relationship of genotype,
phenotype, and fitness in biological systems. The applications cover many areas
of biological research, including biochemistry, genomics, protein and systems
engineering, medicine, and evolutionary biology. However, the quantitative
definitions of epistasis vary among fields, and its analysis beyond just
pairwise effects remains obscure in general. Here, we show that different
definitions of epistasis are versions of a single mathematical formalism - the
weighted Walsh-Hadamard transform. We discuss that one of the definitions, the
backgound-averaged epistasis, is the most informative when the goal is to
uncover the general epistatic structure of a biological system, a description
that can be rather different from the local epistatic structure of specific
model systems. Key issues are the choice of effective ensembles for averaging
and to practically contend with the vast combinatorial complexity of mutations.
In this regard, we discuss possible approaches for optimally learning the
epistatic structure of biological systems.Comment: 6 pages, 3 figures, supplementary informatio
A Reference-Free Lossless Compression Algorithm for DNA Sequences Using a Competitive Prediction of Two Classes of Weighted Models
The development of efficient data compressors for DNA sequences is crucial not only for reducing the storage and the bandwidth for transmission, but also for analysis purposes. In particular, the development of improved compression models directly influences the outcome of anthropological and biomedical compression-based methods. In this paper, we describe a new lossless compressor with improved compression capabilities for DNA sequences representing different domains and kingdoms. The reference-free method uses a competitive prediction model to estimate, for each symbol, the best class of models to be used before applying arithmetic encoding. There are two classes of models: weighted context models (including substitutional tolerant context models) and weighted stochastic repeat models. Both classes of models use specific sub-programs to handle inverted repeats efficiently. The results show that the proposed method attains a higher compression ratio than state-of-the-art approaches, on a balanced and diverse benchmark, using a competitive level of computational resources. An efficient implementation of the method is publicly available, under the GPLv3 license.Peer reviewe
Compressão eficiente de sequências biológicas usando uma rede neuronal
Background: The increasing production of genomic data has led to
an intensified need for models that can cope efficiently with the lossless
compression of biosequences. Important applications include long-term
storage and compression-based data analysis. In the literature, only a
few recent articles propose the use of neural networks for biosequence
compression. However, they fall short when compared with specific
DNA compression tools, such as GeCo2. This limitation is due to the
absence of models specifically designed for DNA sequences. In this
work, we combine the power of neural networks with specific DNA and
amino acids models. For this purpose, we created GeCo3 and AC2, two
new biosequence compressors. Both use a neural network for mixing
the opinions of multiple specific models.
Findings: We benchmark GeCo3 as a reference-free DNA compressor
in five datasets, including a balanced and comprehensive dataset
of DNA sequences, the Y-chromosome and human mitogenome, two
compilations of archaeal and virus genomes, four whole genomes, and
two collections of FASTQ data of a human virome and ancient DNA.
GeCo3 achieves a solid improvement in compression over the previous
version (GeCo2) of 2:4%, 7:1%, 6:1%, 5:8%, and 6:0%, respectively.
As a reference-based DNA compressor, we benchmark GeCo3 in four
datasets constituted by the pairwise compression of the chromosomes
of the genomes of several primates. GeCo3 improves the compression in
12:4%, 11:7%, 10:8% and 10:1% over the state-of-the-art. The cost of
this compression improvement is some additional computational time
(1:7_ to 3:0_ slower than GeCo2). The RAM is constant, and the tool
scales efficiently, independently from the sequence size. Overall, these
values outperform the state-of-the-art. For AC2 the improvements and
costs over AC are similar, which allows the tool to also outperform the
state-of-the-art.
Conclusions: The GeCo3 and AC2 are biosequence compressors with
a neural network mixing approach, that provides additional gains over
top specific biocompressors. The proposed mixing method is portable,
requiring only the probabilities of the models as inputs, providing easy
adaptation to other data compressors or compression-based data analysis
tools. GeCo3 and AC2 are released under GPLv3 and are available
for free download at https://github.com/cobilab/geco3 and
https://github.com/cobilab/ac2.Contexto: O aumento da produção de dados genómicos levou a uma
maior necessidade de modelos que possam lidar de forma eficiente com
a compressão sem perdas de biosequências. Aplicações importantes
incluem armazenamento de longo prazo e análise de dados baseada em
compressão. Na literatura, apenas alguns artigos recentes propõem o
uso de uma rede neuronal para compressão de biosequências. No entanto,
os resultados ficam aquém quando comparados com ferramentas
de compressão de ADN específicas, como o GeCo2. Essa limitação
deve-se à ausência de modelos específicos para sequências de ADN.
Neste trabalho, combinamos o poder de uma rede neuronal com modelos
específicos de ADN e aminoácidos. Para isso, criámos o GeCo3 e
o AC2, dois novos compressores de biosequências. Ambos usam uma
rede neuronal para combinar as opiniões de vários modelos específicos.
Resultados: Comparamos o GeCo3 como um compressor de ADN
sem referência em cinco conjuntos de dados, incluindo um conjunto
de dados balanceado de sequências de ADN, o cromossoma Y e o mitogenoma
humano, duas compilações de genomas de arqueas e vírus,
quatro genomas inteiros e duas coleções de dados FASTQ de um viroma
humano e ADN antigo. O GeCo3 atinge uma melhoria sólida
na compressão em relação à versão anterior (GeCo2) de 2,4%, 7,1%,
6,1%, 5,8% e 6,0%, respectivamente. Como um compressor de ADN
baseado em referência, comparamos o GeCo3 em quatro conjuntos
de dados constituídos pela compressão aos pares dos cromossomas
dos genomas de vários primatas. O GeCo3 melhora a compressão em
12,4%, 11,7%, 10,8% e 10,1% em relação ao estado da arte. O custo
desta melhoria de compressão é algum tempo computacional adicional
(1,7 _ a 3,0 _ mais lento do que GeCo2). A RAM é constante e a
ferramenta escala de forma eficiente, independentemente do tamanho
da sequência. De forma geral, os rácios de compressão superam o estado
da arte. Para o AC2, as melhorias e custos em relação ao AC são
semelhantes, o que permite que a ferramenta também supere o estado
da arte.
Conclusões: O GeCo3 e o AC2 são compressores de sequências biológicas
com uma abordagem de mistura baseada numa rede neuronal,
que fornece ganhos adicionais em relação aos biocompressores específicos
de topo. O método de mistura proposto é portátil, exigindo apenas
as probabilidades dos modelos como entradas, proporcionando uma fácil
adaptação a outros compressores de dados ou ferramentas de análise
baseadas em compressão. O GeCo3 e o AC2 são distribuídos sob GPLv3
e estão disponíveis para download gratuito em https://github.com/
cobilab/geco3 e https://github.com/cobilab/ac2.Mestrado em Engenharia de Computadores e Telemátic
The Capabilities of Chaos and Complexity
To what degree could chaos and complexity have organized a Peptide or RNA World of crude yet necessarily integrated protometabolism? How far could such protolife evolve in the absence of a heritable linear digital symbol system that could mutate, instruct, regulate, optimize and maintain metabolic homeostasis? To address these questions, chaos, complexity, self-ordered states, and organization must all be carefully defined and distinguished. In addition their cause-and-effect relationships and mechanisms of action must be delineated. Are there any formal (non physical, abstract, conceptual, algorithmic) components to chaos, complexity, self-ordering and organization, or are they entirely physicodynamic (physical, mass/energy interaction alone)? Chaos and complexity can produce some fascinating self-ordered phenomena. But can spontaneous chaos and complexity steer events and processes toward pragmatic benefit, select function over non function, optimize algorithms, integrate circuits, produce computational halting, organize processes into formal systems, control and regulate existing systems toward greater efficiency? The question is pursued of whether there might be some yet-to-be discovered new law of biology that will elucidate the derivation of prescriptive information and control. “System” will be rigorously defined. Can a low-informational rapid succession of Prigogine’s dissipative structures self-order into bona fide organization
Next generation analytic tools for large scale genetic epidemiology studies of complex diseases
Over the past several years, genome‐wide association studies (GWAS) have succeeded in identifying hundreds of genetic markers associated with common diseases. However, most of these markers confer relatively small increments of risk and explain only a small proportion of familial clustering. To identify obstacles to future progress in genetic epidemiology research and provide recommendations to NIH for overcoming these barriers, the National Cancer Institute sponsored a workshop entitled “Next Generation Analytic Tools for Large‐Scale Genetic Epidemiology Studies of Complex Diseases” on September 15–16, 2010. The goal of the workshop was to facilitate discussions on (1) statistical strategies and methods to efficiently identify genetic and environmental factors contributing to the risk of complex disease; and (2) how to develop, apply, and evaluate these strategies for the design, analysis, and interpretation of large‐scale complex disease association studies in order to guide NIH in setting the future agenda in this area of research. The workshop was organized as a series of short presentations covering scientific (gene‐gene and gene‐environment interaction, complex phenotypes, and rare variants and next generation sequencing) and methodological (simulation modeling and computational resources and data management) topic areas. Specific needs to advance the field were identified during each session and are summarized. Genet. Epidemiol . 36 : 22–35, 2012. © 2011 Wiley Periodicals, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/93578/1/gepi20652.pd
Advanced Methods for Real-time Metagenomic Analysis of Nanopore Sequencing Data
Whole shotgun metagenomics sequencing allows researchers to retrieve information about all organisms in a complex sample. This method enables microbiologists to detect pathogens in clinical samples, study the microbial diversity in various environments, and detect abundance differences of certain microbes under different living conditions. The emergence of nanopore sequencing has offered many new possibilities for clinical and environmental microbiologists. In particular, the portability of the small nanopore sequencing devices and the ability to selectively sequence only DNA from interesting organisms are expected to make a significant contribution to the field. However, both options require memory-efficient methods that perform real-time data analysis on commodity hardware like usual laptops.
In this thesis, I present new methods for real-time analysis of nanopore sequencing data in a metagenomic context. These methods are based on optimized algorithmic approaches querying the sequenced data against large sets of reference sequences. The main goal of those contributions is to improve the sequencing and analysis of underrepresented organisms in complex metagenomic samples and enable this analysis in low-resource settings in the field.
First, I introduce ReadBouncer as a new tool for nanopore adaptive sampling, which can reject uninteresting DNA molecules during the sequencing process. ReadBouncer improves read classification compared to other adaptive sampling tools and has fewer memory requirements. These improvements enable a higher enrichment of underrepresented sequences while performing adaptive sampling in the field. I further show that, besides host sequence removal and enrichment of low-abundant microbes, adaptive sampling can enrich underrepresented plasmid sequences in bacterial samples. These plasmids play a crucial role in the dissemination of antibiotic resistance genes. However, their characterization requires expensive and time-consuming lab protocols. I describe how adaptive sampling can be used as a cheap method for the enrichment of plasmids, which can make a significant contribution to the point-of-care sequencing of bacterial pathogens. Finally, I introduce a novel memory- and space-efficient algorithm for real-time taxonomic profiling of nanopore reads that was implemented in Taxor. It improves the taxonomic classification of nanopore reads compared to other taxonomic profiling tools and tremendously reduces the memory footprint. The resulting database index for thousands of microbial species is small enough to fit into the memory of a small laptop, enabling real-time metagenomics analysis of nanopore sequencing data with
large reference databases in the field
- …