35,956 research outputs found
Recovering Sparse Signals Using Sparse Measurement Matrices in Compressed DNA Microarrays
Microarrays (DNA, protein, etc.) are massively parallel affinity-based biosensors capable of detecting and quantifying a large number of different genomic particles simultaneously. Among them, DNA microarrays comprising tens of thousands of probe spots are currently being employed to test multitude of targets in a single experiment. In conventional microarrays, each spot contains a large number of copies of a single probe designed to capture a single target, and, hence, collects only a single data point. This is a wasteful use of the sensing resources in comparative DNA microarray experiments, where a test sample is measured relative to a reference sample. Typically, only a fraction of the total number of genes represented by the two samples is differentially expressed, and, thus, a vast number of probe spots may not provide any useful information. To this end, we propose an alternative design, the so-called compressed microarrays, wherein each spot contains copies of several different probes and the total number of spots is potentially much smaller than the number of targets being tested. Fewer spots directly translates to significantly lower costs due to cheaper array manufacturing, simpler image acquisition and processing, and smaller amount of genomic material needed for experiments. To recover signals from compressed microarray measurements, we leverage ideas from compressive sampling. For sparse measurement matrices, we propose an algorithm that has significantly lower computational complexity than the widely used linear-programming-based methods, and can also recover signals with less sparsity
A backward procedure for change-point detection with applications to copy number variation detection
Change-point detection regains much attention recently for analyzing array or
sequencing data for copy number variation (CNV) detection. In such
applications, the true signals are typically very short and buried in the long
data sequence, which makes it challenging to identify the variations
efficiently and accurately. In this article, we propose a new change-point
detection method, a backward procedure, which is not only fast and simple
enough to exploit high-dimensional data but also performs very well for
detecting short signals. Although motivated by CNV detection, the backward
procedure is generally applicable to assorted change-point problems that arise
in a variety of scientific applications. It is illustrated by both simulated
and real CNV data that the backward detection has clear advantages over other
competing methods especially when the true signal is short
Bacterial Community Reconstruction Using A Single Sequencing Reaction
Bacteria are the unseen majority on our planet, with millions of species and
comprising most of the living protoplasm. While current methods enable in-depth
study of a small number of communities, a simple tool for breadth studies of
bacterial population composition in a large number of samples is lacking. We
propose a novel approach for reconstruction of the composition of an unknown
mixture of bacteria using a single Sanger-sequencing reaction of the mixture.
This method is based on compressive sensing theory, which deals with
reconstruction of a sparse signal using a small number of measurements.
Utilizing the fact that in many cases each bacterial community is comprised of
a small subset of the known bacterial species, we show the feasibility of this
approach for determining the composition of a bacterial mixture. Using
simulations, we show that sequencing a few hundred base-pairs of the 16S rRNA
gene sequence may provide enough information for reconstruction of mixtures
containing tens of species, out of tens of thousands, even in the presence of
realistic measurement noise. Finally, we show initial promising results when
applying our method for the reconstruction of a toy experimental mixture with
five species. Our approach may have a potential for a practical and efficient
way for identifying bacterial species compositions in biological samples.Comment: 28 pages, 12 figure
Data Discovery and Anomaly Detection Using Atypicality: Theory
A central question in the era of 'big data' is what to do with the enormous
amount of information. One possibility is to characterize it through
statistics, e.g., averages, or classify it using machine learning, in order to
understand the general structure of the overall data. The perspective in this
paper is the opposite, namely that most of the value in the information in some
applications is in the parts that deviate from the average, that are unusual,
atypical. We define what we mean by 'atypical' in an axiomatic way as data that
can be encoded with fewer bits in itself rather than using the code for the
typical data. We show that this definition has good theoretical properties. We
then develop an implementation based on universal source coding, and apply this
to a number of real world data sets.Comment: 40 page
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
Structural Alignment of RNAs Using Profile-csHMMs and Its Application to RNA Homology Search: Overview and New Results
Systematic research on noncoding RNAs (ncRNAs) has revealed that many ncRNAs are actively involved in various biological networks. Therefore, in order to fully understand the mechanisms of these networks, it is crucial to understand the roles of ncRNAs. Unfortunately, the annotation of ncRNA genes that give rise to functional RNA molecules has begun only recently, and it is far from being complete. Considering the huge amount of genome sequence data, we need efficient computational methods for finding ncRNA genes. One effective way of finding ncRNA genes is to look for regions that are similar to known ncRNA genes. As many ncRNAs have well-conserved secondary structures, we need statistical models that can represent such structures for this purpose. In this paper, we propose a new method for representing RNA sequence profiles and finding structural alignment of RNAs based on profile context-sensitive hidden Markov models (profile-csHMMs). Unlike existing models, the proposed approach can handle any kind of RNA secondary structures, including pseudoknots. We show that profile-csHMMs can provide an effective framework for the computational analysis of RNAs and the identification of ncRNA genes
Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data
Background: MicroRNAs (miRNAs) are short, non-coding RNA regulators of protein coding genes. miRNAs play a very important role in diverse biological processes and various diseases. Many algorithms are able to predict miRNA genes and their targets, but their transcription regulation is still under investigation. It is generally believed that intragenic miRNAs (located in introns or exons of protein coding genes) are co-transcribed with their host genes and most intergenic miRNAs transcribed from their own RNA polymerase II (Pol II) promoter. However, the length of the primary transcripts and promoter organization is currently unknown. Methodology: We performed Pol II chromatin immunoprecipitation (ChIP)-chip using a custom array surrounding regions of known miRNA genes. To identify the true core transcription start sites of the miRNA genes we developed a new tool (CPPP). We showed that miRNA genes can be transcribed from promoters located several kilobases away and that their promoters share the same general features as those of protein coding genes. Finally, we found evidence that as many as 26% of the intragenic miRNAs may be transcribed from their own unique promoters. Conclusion: miRNA promoters have similar features to those of protein coding genes, but miRNA transcript organization is more complex. © 2009 Corcoran et al
DNA as a medium for storing digital signals
Motivated by the storage capacity and efficiency of the DNA molecule in this paper we propose to utilize DNA molecules to store digital signals. We show that hybridization of DNA molecules can be used as a similarity criterion for retrieving digital signals encoded and stored in a DNA database. Since retrieval is achieved through hybridization of query and data carrying DNA molecules, we present a mathematical model to estimate hybridization efficiency (also known as selectivity annealing). We show that selectivity annealing is inversely proportional to the mean squared error (MSE) of the encoded signal values. In addition, we show that the concentration of the molecules plays the same role as the decision threshold employed in digital signal matching algorithms. Finally, similarly to the digital domain, we define a DNA signal-to-noise ratio (SNR) measure to assess the performance of the DNA-based retrieval scheme. Simulations are presented to validate our arguments
- …