42 research outputs found

    Optimal Haplotype Assembly from High-Throughput Mate-Pair Reads

    Full text link
    Humans have 2323 pairs of homologous chromosomes. The homologous pairs are almost identical pairs of chromosomes. For the most part, differences in homologous chromosome occur at certain documented positions called single nucleotide polymorphisms (SNPs). A haplotype of an individual is the pair of sequences of SNPs on the two homologous chromosomes. In this paper, we study the problem of inferring haplotypes of individuals from mate-pair reads of their genome. We give a simple formula for the coverage needed for haplotype assembly, under a generative model. The analysis here leverages connections of this problem with decoding convolutional codes.Comment: 10 pages, 4 figures, Submitted to ISIT 201

    Partial DNA Assembly: A Rate-Distortion Perspective

    Full text link
    Earlier formulations of the DNA assembly problem were all in the context of perfect assembly; i.e., given a set of reads from a long genome sequence, is it possible to perfectly reconstruct the original sequence? In practice, however, it is very often the case that the read data is not sufficiently rich to permit unambiguous reconstruction of the original sequence. While a natural generalization of the perfect assembly formulation to these cases would be to consider a rate-distortion framework, partial assemblies are usually represented in terms of an assembly graph, making the definition of a distortion measure challenging. In this work, we introduce a distortion function for assembly graphs that can be understood as the logarithm of the number of Eulerian cycles in the assembly graph, each of which correspond to a candidate assembly that could have generated the observed reads. We also introduce an algorithm for the construction of an assembly graph and analyze its performance on real genomes.Comment: To be published at ISIT-2016. 11 pages, 10 figure

    Explicit MBR All-Symbol Locality Codes

    Full text link
    Node failures are inevitable in distributed storage systems (DSS). To enable efficient repair when faced with such failures, two main techniques are known: Regenerating codes, i.e., codes that minimize the total repair bandwidth; and codes with locality, which minimize the number of nodes participating in the repair process. This paper focuses on regenerating codes with locality, using pre-coding based on Gabidulin codes, and presents constructions that utilize minimum bandwidth regenerating (MBR) local codes. The constructions achieve maximum resilience (i.e., optimal minimum distance) and have maximum capacity (i.e., maximum rate). Finally, the same pre-coding mechanism can be combined with a subclass of fractional-repetition codes to enable maximum resilience and repair-by-transfer simultaneously

    Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts

    Get PDF
    Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays

    Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts

    Get PDF
    Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling, which limits their scope and generality. We propose a novel method that compares and clusters cells based on their transcript-compatibility read counts rather than on the transcript or gene quantifications used in standard analysis pipelines. In the reanalysis of two landmark yet disparate single-cell RNA-seq datasets, we show that our method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays

    Subnational mapping of HIV incidence and mortality among individuals aged 15–49 years in sub-Saharan Africa, 2000–18 : a modelling study

    Get PDF
    Background: High-resolution estimates of HIV burden across space and time provide an important tool for tracking and monitoring the progress of prevention and control efforts and assist with improving the precision and efficiency of targeting efforts. We aimed to assess HIV incidence and HIV mortality for all second-level administrative units across sub-Saharan Africa. Methods: In this modelling study, we developed a framework that used the geographically specific HIV prevalence data collected in seroprevalence surveys and antenatal care clinics to train a model that estimates HIV incidence and mortality among individuals aged 15–49 years. We used a model-based geostatistical framework to estimate HIV prevalence at the second administrative level in 44 countries in sub-Saharan Africa for 2000–18 and sought data on the number of individuals on antiretroviral therapy (ART) by second-level administrative unit. We then modified the Estimation and Projection Package (EPP) to use these HIV prevalence and treatment estimates to estimate HIV incidence and mortality by second-level administrative unit. Findings: The estimates suggest substantial variation in HIV incidence and mortality rates both between and within countries in sub-Saharan Africa, with 15 countries having a ten-times or greater difference in estimated HIV incidence between the second-level administrative units with the lowest and highest estimated incidence levels. Across all 44 countries in 2018, HIV incidence ranged from 2 ·8 (95% uncertainty interval 2·1–3·8) in Mauritania to 1585·9 (1369·4–1824·8) cases per 100 000 people in Lesotho and HIV mortality ranged from 0·8 (0·7–0·9) in Mauritania to 676· 5 (513· 6–888·0) deaths per 100 000 people in Lesotho. Variation in both incidence and mortality was substantially greater at the subnational level than at the national level and the highest estimated rates were accordingly higher. Among second-level administrative units, Guijá District, Gaza Province, Mozambique, had the highest estimated HIV incidence (4661·7 [2544·8–8120·3]) cases per 100000 people in 2018 and Inhassunge District, Zambezia Province, Mozambique, had the highest estimated HIV mortality rate (1163·0 [679·0–1866·8]) deaths per 100 000 people. Further, the rate of reduction in HIV incidence and mortality from 2000 to 2018, as well as the ratio of new infections to the number of people living with HIV was highly variable. Although most second-level administrative units had declines in the number of new cases (3316 [81· 1%] of 4087 units) and number of deaths (3325 [81·4%]), nearly all appeared well short of the targeted 75% reduction in new cases and deaths between 2010 and 2020. Interpretation: Our estimates suggest that most second-level administrative units in sub-Saharan Africa are falling short of the targeted 75% reduction in new cases and deaths by 2020, which is further compounded by substantial within-country variability. These estimates will help decision makers and programme implementers expand access to ART and better target health resources to higher burden subnational areas
    corecore