499 research outputs found

    A Combinatorial Approach for Single-cell Variant Detection via Phylogenetic Inference

    Get PDF
    Single-cell sequencing provides a powerful approach for elucidating intratumor heterogeneity by resolving cell-to-cell variability. However, it also poses additional challenges including elevated error rates, allelic dropout and non-uniform coverage. A recently introduced single-cell-specific mutation detection algorithm leverages the evolutionary relationship between cells for denoising the data. However, due to its probabilistic nature, this method does not scale well with the number of cells. Here, we develop a novel combinatorial approach for utilizing the genealogical relationship of cells in detecting mutations from noisy single-cell sequencing data. Our method, called scVILP, jointly detects mutations in individual cells and reconstructs a perfect phylogeny among these cells. We employ a novel Integer Linear Program algorithm for deterministically and efficiently solving the joint inference problem. We show that scVILP achieves similar or better accuracy but significantly better runtime over existing methods on simulated data. We also applied scVILP to an empirical human cancer dataset from a high grade serous ovarian cancer patient

    Methods to study splicing from high-throughput RNA Sequencing data

    Full text link
    The development of novel high-throughput sequencing (HTS) methods for RNA (RNA-Seq) has provided a very powerful mean to study splicing under multiple conditions at unprecedented depth. However, the complexity of the information to be analyzed has turned this into a challenging task. In the last few years, a plethora of tools have been developed, allowing researchers to process RNA-Seq data to study the expression of isoforms and splicing events, and their relative changes under different conditions. We provide an overview of the methods available to study splicing from short RNA-Seq data. We group the methods according to the different questions they address: 1) Assignment of the sequencing reads to their likely gene of origin. This is addressed by methods that map reads to the genome and/or to the available gene annotations. 2) Recovering the sequence of splicing events and isoforms. This is addressed by transcript reconstruction and de novo assembly methods. 3) Quantification of events and isoforms. Either after reconstructing transcripts or using an annotation, many methods estimate the expression level or the relative usage of isoforms and/or events. 4) Providing an isoform or event view of differential splicing or expression. These include methods that compare relative event/isoform abundance or isoform expression across two or more conditions. 5) Visualizing splicing regulation. Various tools facilitate the visualization of the RNA-Seq data in the context of alternative splicing. In this review, we do not describe the specific mathematical models behind each method. Our aim is rather to provide an overview that could serve as an entry point for users who need to decide on a suitable tool for a specific analysis. We also attempt to propose a classification of the tools according to the operations they do, to facilitate the comparison and choice of methods.Comment: 31 pages, 1 figure, 9 tables. Small corrections adde

    Designing Efficient Spaced Seeds for SOLiD Read Mapping

    Get PDF
    The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency

    Data-Driven Methods for Demand-Side Flexibility in Energy Systems

    Get PDF

    Investigation of charge coupled device correlation techniques

    Get PDF
    Analog Charge Transfer Devices (CTD's) offer unique advantages to signal processing systems, which often have large development costs, making it desirable to define those devices which can be developed for general system's use. Such devices are best identified and developed early to give system's designers some interchangeable subsystem blocks, not requiring additional individual development for each new signal processing system. The objective of this work is to describe a discrete analog signal processing device with a reasonably broad system use and to implement its design, fabrication, and testing

    Deconvolution of Expression for Nascent RNA sequencing data (DENR) highlights pre-RNA isoform diversity in human cells.

    Get PDF
    MOTIVATION: Quantification of isoform abundance has been extensively studied at the mature-RNA level using RNA-seq but not at the level of precursor RNAs using nascent RNA sequencing. RESULTS: We address this problem with a new computational method called Deconvolution of Expression for Nascent RNA sequencing data (DENR), which models nascent RNA sequencing read counts as a mixture of user-provided isoforms. The baseline algorithm is enhanced by machine-learning predictions of active transcription start sites and an adjustment for the typical "shape profile" of read counts along a transcription unit. We show that DENR outperforms simple read-count-based methods for estimating gene and isoform abundances, and that transcription of multiple pre-RNA isoforms per gene is widespread, with frequent differences between cell types. In addition, we provide evidence that a majority of human isoform diversity derives from primary transcription rather than from post-transcriptional processes. AVAILABILITY: DENR and nascentRNASim are freely available at https://github.com/CshlSiepelLab/DENR (version v1.0.0) and https://github.com/CshlSiepelLab/nascentRNASim (version v0.3.0). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    High-quality, high-throughput measurement of protein-DNA binding using HiTS-FLIP

    Get PDF
    In order to understand in more depth and on a genome wide scale the behavior of transcription factors (TFs), novel quantitative experiments with high-throughput are needed. Recently, HiTS-FLIP (High-Throughput Sequencing-Fluorescent Ligand Interaction Profiling) was invented by the Burge lab at the MIT (Nutiu et al. (2011)). Based on an Illumina GA-IIx machine for next-generation sequencing, HiTS-FLIP allows to measure the affinity of fluorescent labeled proteins to millions of DNA clusters at equilibrium in an unbiased and untargeted way examining the entire sequence space by Determination of dissociation constants (Kds) for all 12-mer DNA motifs. During my PhD I helped to improve the experimental design of this method to allow measuring the protein-DNA binding events at equilibrium omitting any washing step by utilizing the TIRF (Total Internal Reflection Fluorescence) based optics of the GA-IIx. In addition, I developed the first versions of XML based controlling software that automates the measurement procedure. Meeting the needs for processing the vast amount of data produced by each run, I developed a sophisticated, high performance software pipeline that locates DNA clusters, normalizes and extracts the fluorescent signals. Moreover, cluster contained k-mer motifs are ranked and their DNA binding affinities are quantified with high accuracy. My approach of applying phase-correlation to estimate the relative translative Offset between the observed tile images and the template images omits resequencing and thus allows to reuse the flow cell for several HiTS-FLIP experiments, which greatly reduces cost and time. Instead of using information from the sequencing images like Nutiu et al. (2011) for normalizing the cluster intensities which introduces a nucleotide specific bias, I estimate the cluster related normalization factors directly from the protein Images which captures the non-even illumination bias more accurately and leads to an improved correction for each tile image. My analysis of the ranking algorithm by Nutiu et al. (2011) has revealed that it is unable to rank all measured k-mers. Discarding all the clusters related to previously ranked k-mers has the side effect of eliminating any clusters on which k-mers could be ranked that share submotifs with previously ranked k-mers. This shortcoming affects even strong binding k-mers with only one mutation away from the top ranked k-mer. My findings show that omitting the cluster deletion step in the ranking process overcomes this limitation and allows to rank the full spectrum of all possible k-mers. In addition, the performance of the ranking algorithm is drastically reduced by my insight from a quadratic to a linear run time. The experimental improvements combined with the sophisticated processing of the data has led to a very high accuracy of the HiTS-FLIP dissociation constants (Kds) comparable to the Kds measured by the very sensitive HiP-FA assay (Jung et al. (2015)). However, experimentally HiTS-FLIP is a very challenging assay. In total, eight HiTS-FLIP experiments were performed but only one showed saturation, the others exhibited Protein aggregation occurring at the amplified DNA clusters. This biochemical issue could not be remedied. As example TF for studying the details of HiTS-FLIP, GCN4 was chosen which is a dimeric, basic leucine zipper TF and which acts as the master regulator of the amino acid starvation Response in Saccharomyces cerevisiae (Natarajan et al. (2001)). The fluorescent dye was mOrange. The HiTS-FLIP Kds for the TF GCN4 were validated by the HiP-FA assay and a Pearson correlation coefficient of R=0.99 and a relative error of delta=30.91% was achieved. Thus, a unique and comprehensive data set of utmost quantitative precision was obtained that allowed to study the complex binding behavior of GCN4 in a new way. My Downstream analyses reveal that the known 7-mer consensus motif of GCN4, which is TGACTCA, is modulated by its 2-mer neighboring flanking regions spanning an affinity range over two orders of magnitude from a Kd=1.56 nM to Kd=552.51 nM. These results suggest that the common 9-mer PWM (Position Weight Matrix) for GCN4 is insufficient to describe the binding behavior of GCN4. Rather, an additional left and right flanking nucleotide is required to extend the 9-mer to an 11-mer. My analyses regarding mutations and related delta delta G values suggest long-range interdependencies between nucleotides of the two dimeric half-sites of GCN4. Consequently, models assuming positional independence, such as a PWM, are insufficient to explain these interdependencies. Instead, the full spectrum of affinity values for all k-mers of appropriate size should be measured and applied in further analyses as proposed by Nutiu et al. (2011). Another discovery were new binding motifs of GCN4, which can only be detected with a method like HiTS-FLIP that examines the entire sequence space and allows for unbiased, de-novo motif discovery. All These new motifs contain GTGT as a submotif and the data collected suggests that GCN4 binds as monomer to these new motifs. Therefore, it might be even possible to detect different binding modes with HiTS-FLIP. My results emphasize the binding complexity of GCN4 and demonstrate the advantage of HiTS-FLIP for investigating the complexity of regulative processes

    Coding for storage and testing

    Get PDF
    The problem of reconstructing strings from substring information has found many applications due to its importance in genomic data sequencing and DNA- and polymer-based data storage. Motivated by platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose new a family of codes that allows for both unique string reconstruction and correction of multiple mass errors. We first consider the paradigm where the masses of substrings of the input string form the evidence set. We consider two approaches: The first approach pertains to asymmetric errors and the error-correction is achieved by introducing redundancy that scales linearly with the number of errors and logarithmically with the length of the string. The proposed construction allows for the string to be uniquely reconstructed based only on its erroneous substring composition multiset. The asymptotic code rate of the scheme is one, and decoding is accomplished via a simplified version of the Backtracking algorithm used for the Turnpike problem. For symmetric errors, we use a polynomial characterization of the mass information and adapt polynomial evaluation code constructions for this setting. In the process, we develop new efficient decoding algorithms for a constant number of composition errors. The second part of this dissertation addresses a practical paradigm that requires reconstructing mixtures of strings based on the union of compositions of their prefixes and suffixes, generated by mass spectrometry devices. We describe new coding methods that allow for unique joint reconstruction of subsets of strings selected from a code and provide upper and lower bounds on the asymptotic rate of the underlying codebooks. Our code constructions combine properties of binary BhB_h and Dyck strings and can be extended to accommodate missing substrings in the pool. In the final chapter of this dissertation, we focus on group testing. We begin with a review of the gold-standard testing protocol for Covid-19, real-time, reverse transcription PCR, and its properties and associated measurement data such as amplification curves that can guide the development of appropriate and accurate adaptive group testing protocols. We then proceed to examine various off-the-shelf group testing methods for Covid-19, and identify their strengths and weaknesses for the application at hand. Finally, we present a collection of new analytical results for adaptive semiquantitative group testing with combinatorial priors, including performance bounds, algorithmic solutions, and noisy testing protocols. The worst-case paradigm extends and improves upon prior work on semiquantitative group testing with and without specialized PCR noise models
    corecore