27 research outputs found

    Genetic Sequence Matching Using D4M Big Data Approaches

    Full text link
    Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.Comment: 6 pages; to appear in IEEE High Performance Extreme Computing (HPEC) 201

    Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra

    Full text link
    The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4^{4}RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D4^{4}RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest

    A Linear Algebra Approach to Fast DNA Mixture Analysis Using GPUs

    Full text link
    Analysis of DNA samples is an important step in forensics, and the speed of analysis can impact investigations. Comparison of DNA sequences is based on the analysis of short tandem repeats (STRs), which are short DNA sequences of 2-5 base pairs. Current forensics approaches use 20 STR loci for analysis. The use of single nucleotide polymorphisms (SNPs) has utility for analysis of complex DNA mixtures. The use of tens of thousands of SNPs loci for analysis poses significant computational challenges because the forensic analysis scales by the product of the loci count and number of DNA samples to be analyzed. In this paper, we discuss the implementation of a DNA sequence comparison algorithm by re-casting the algorithm in terms of linear algebra primitives. By developing an overloaded matrix multiplication approach to DNA comparisons, we can leverage advances in GPU hardware and algoithms for Dense Generalized Matrix-Multiply (DGEMM) to speed up DNA sample comparisons. We show that it is possible to compare 2048 unknown DNA samples with 20 million known samples in under 6 seconds using a NVIDIA K80 GPU.Comment: Accepted for publication at the 2017 IEEE High Performance Extreme Computing conferenc

    Performance of Bootstrap Embedding for long-range interactions and 2D systems

    Get PDF
    Fragment embedding approaches offer the possibility of accurate description of strongly correlated systems with low-scaling computational expense. In particular, wave function embedding approaches have demonstrated the ability to subdivide systems across highly entangled regions, promising wide applicability for a number of challenging systems. In this paper, we focus on the wave function embedding method Bootstrap Embedding, extending it to the Pariser–Parr–Pople and 2D Hubbard models in order to evaluate the behaviour of the method in systems that are less amenable to local fragment embedding. We find that Bootstrap Embedding remains accurate for these systems, and we investigate how fragment size, shape, and choice of matching conditions affect the results. We also evaluate the properties of Bootstrap Embedding that lead to the method's favourable convergence properties. Keywords: Embedding; correlation; Bootstrap; DMETNational Science Foundation (U.S.) (Grant CHE-1464804

    Construction of an ~700-kb transcript map around the Familial Mediterranean Fever locus on human chromosome 16p13.3

    Get PDF
    We used a combination of cDNA selection, exon amplification, and computational prediction from genomic sequence to isolate transcribed sequences from genomic DNA surrounding the familial Mediterranean fever (FMF) locus. Eighty-seven kb of genomic DNA around D16S3370, a marker showing a high degree of linkage disequilibrium with FMF, was sequenced to completion, and the sequence annotated. A transcript map reflecting the minimal number of genes encoded within the ∼700 kb of genomic DNA surrounding the FMF locus was assembled. This map consists of 27 genes with discreet messages detectable on Northerns, in addition to three olfactory-receptor genes, a cluster of 18 tRNA genes, and two putative transcriptional units that have typical intron–exon splice junctions yet do not detect messages on Northerns. Four of the transcripts are identical to genes described previously, seven have been independently identified by the French FMF Consortium, and the others are novel. Six related zinc-finger genes, a cluster of tRNAs, and three olfactory receptors account for the majority of transcribed sequences isolated from a 315-kb FMF central region (betweenD16S468/D16S3070 and cosmid 377A12). Interspersed among them are several genes that may be important in inflammation. This transcript map not only has permitted the identification of the FMF gene (MEFV), but also has provided us an opportunity to probe the structural and functional features of this region of chromosome 16.Michael Centola, Xiaoguang Chen, Raman Sood, Zuoming Deng, Ivona Aksentijevich, Trevor Blake, Darrell O. Ricke, Xiang Chen, Geryl Wood, Nurit Zaks, Neil Richards, David Krizman, Elizabeth Mansfield, Sinoula Apostolou, Jingmei Liu, Neta Shafran, Anil Vedula, Melanie Hamon, Andrea Cercek, Tanaz Kahan, Deborah Gumucio, David F. Callen, Robert I. Richards, Robert K. Moyzis, Norman A. Doggett, Francis S. Collins, P. Paul Liu, Nathan Fischel-Ghodsian and Daniel L. Kastne

    Defined Mixtures Set 1

    No full text
    Defined mixtures of 2 to 5 contributors for 3K and 39K SNP Estonian profile

    Fast P(RMNE) Data

    No full text
    Data associated with Fast P(RMNE) article on rapid high precision random man not excluded calculation

    11 Million SNP Profiles datasets

    No full text
    High throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) provides additional applications for DNA forensics including identification, mixture analysis, kinship prediction, and biogeographic ancestry prediction. Public repositories of human genetic data are being rapidly generated and released, but the majorities of these samples are de-identified to protect privacy, and have little or no individual metadata such as appearance (photos), ethnicity, relatives, etc. A reference in silico dataset has been generated to enable development and testing of new DNA forensics algorithms. This dataset provides 11 million SNP profiles for individuals with defined ethnicities and family relationships spanning eight generations with admixture for a panel with 39,108 SNPs

    Two Different Antibody-Dependent Enhancement (ADE) Risks for SARS-CoV-2 Antibodies

    No full text
    COVID-19 (SARS-CoV-2) disease severity and stages varies from asymptomatic, mild flu-like symptoms, moderate, severe, critical, and chronic disease. COVID-19 disease progression include lymphopenia, elevated proinflammatory cytokines and chemokines, accumulation of macrophages and neutrophils in lungs, immune dysregulation, cytokine storms, acute respiratory distress syndrome (ARDS), etc. Development of vaccines to severe acute respiratory syndrome (SARS), Middle East Respiratory Syndrome coronavirus (MERS-CoV), and other coronavirus has been difficult to create due to vaccine induced enhanced disease responses in animal models. Multiple betacoronaviruses including SARS-CoV-2 and SARS-CoV-1 expand cellular tropism by infecting some phagocytic cells (immature macrophages and dendritic cells) via antibody bound Fc receptor uptake of virus. Antibody-dependent enhancement (ADE) may be involved in the clinical observation of increased severity of symptoms associated with early high levels of SARS-CoV-2 antibodies in patients. Infants with multisystem inflammatory syndrome in children (MIS-C) associated with COVID-19 may also have ADE caused by maternally acquired SARS-CoV-2 antibodies bound to mast cells. ADE risks associated with SARS-CoV-2 has implications for COVID-19 and MIS-C treatments, B-cell vaccines, SARS-CoV-2 antibody therapy, and convalescent plasma therapy for patients. SARS-CoV-2 antibodies bound to mast cells may be involved in MIS-C and multisystem inflammatory syndrome in adults (MIS-A) following initial COVID-19 infection. SARS-CoV-2 antibodies bound to Fc receptors on macrophages and mast cells may represent two different mechanisms for ADE in patients. These two different ADE risks have possible implications for SARS-CoV-2 B-cell vaccines for subsets of populations based on age, cross-reactive antibodies, variabilities in antibody levels over time, and pregnancy. These models place increased emphasis on the importance of developing safe SARS-CoV-2 T cell vaccines that are not dependent upon antibodies
    corecore