2,112 research outputs found
Reference Based Genome Compression
DNA sequencing technology has advanced to a point where storage is becoming
the central bottleneck in the acquisition and mining of more data. Large
amounts of data are vital for genomics research, and generic compression tools,
while viable, cannot offer the same savings as approaches tuned to inherent
biological properties. We propose an algorithm to compress a target genome
given a known reference genome. The proposed algorithm first generates a
mapping from the reference to the target genome, and then compresses this
mapping with an entropy coder. As an illustration of the performance: applying
our algorithm to James Watson's genome with hg18 as a reference, we are able to
reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it
to 834.8 MB.Comment: 5 pages; Submitted to the IEEE Information Theory Workshop (ITW) 201
Benchmarking database systems for Genomic Selection implementation
Motivation: With high-throughput genotyping systems now available, it has become feasible to fully integrate genotyping information into breeding programs. To make use of this information effectively requires DNA extraction facilities and marker production facilities that can efficiently deploy the desired set of markers across samples with a rapid turnaround time that allows for selection before crosses needed to be made. In reality, breeders often have a short window of time to make decisions by the time they are able to collect all their phenotyping data and receive corresponding genotyping data. This presents a challenge to organize information and utilize it in downstream analyses to support decisions made by breeders. In order to implement genomic selection routinely as part of breeding programs, one would need an efficient genotyping data storage system. We selected and benchmarked six popular open-source data storage systems, including relational database management and columnar storage systems. Results: We found that data extract times are greatly influenced by the orientation in which genotype data is stored in a system. HDF5 consistently performed best, in part because it can more efficiently work with both orientations of the allele matrix
Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly
Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested
that in a string graph or equivalently a unitig graph, any path spells a valid
assembly. As a string/unitig graph also encodes every valid assembly of reads,
such a graph, provided that it can be constructed correctly, is in fact a
lossless representation of reads. In principle, every analysis based on
whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion
(INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context
of resequencing, we developed a de novo assembler, fermi, that assembles
Illumina short reads into unitigs while preserving most of information of the
input reads. SNPs and INDELs can be called by mapping the unitigs against a
reference genome. By applying the method on 35-fold human resequencing data, we
showed that in comparison to the standard pipeline, our approach yields similar
accuracy for SNP calling and better results for INDEL calling. It has higher
sensitivity than other de novo assembly based methods for variant calling. Our
work suggests that variant calling with de novo assembly be a beneficial
complement to the standard variant calling pipeline for whole-genome
resequencing. In the methodological aspects, we proposed FMD-index for
forward-backward extension of DNA sequences, a fast algorithm for finding all
super-maximal exact matches and one-pass construction of unitigs from an
FMD-index.
Availability: http://github.com/lh3/fermi
Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page
Indexing large genome collections on a PC
Motivation: The availability of thousands of invidual genomes of one species
should boost rapid progress in personalized medicine or understanding of the
interaction between genotype and phenotype, to name a few applications. A key
operation useful in such analyses is aligning sequencing reads against a
collection of genomes, which is costly with the use of existing algorithms due
to their large memory requirements.
Results: We present MuGI, Multiple Genome Index, which reports all
occurrences of a given pattern, in exact and approximate matching model,
against a collection of thousand(s) genomes. Its unique feature is the small
index size fitting in a standard computer with 16--32\,GB, or even 8\,GB, of
RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is
also fast. For example, the exact matching queries are handled in average time
of 39\,s and with up to 3 mismatches in 373\,s on the test PC with
the index size of 13.4\,GB. For a smaller index, occupying 7.4\,GB in memory,
the respective times grow to 76\,s and 917\,s.
Availability: Software and Suuplementary material:
\url{http://sun.aei.polsl.pl/mugi}
Second-generation PLINK: rising to the challenge of larger and richer datasets
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide
association studies (GWAS) and research in population genetics. However, the
steady accumulation of data from imputation and whole-genome sequencing studies
has exposed a strong need for even faster and more scalable implementations of
key functions. In addition, GWAS and population-genetic data now frequently
contain probabilistic calls, phase information, and/or multiallelic variants,
none of which can be represented by PLINK 1's primary data format.
To address these issues, we are developing a second-generation codebase for
PLINK. The first major release from this codebase, PLINK 1.9, introduces
extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space
Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic
improvements. In combination, these changes accelerate most operations by 1-4
orders of magnitude, and allow the program to handle datasets too large to fit
in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data
format capable of efficiently representing probabilities, phase, and
multiallelic variants, and (b) extensions of many functions to account for the
new types of information.
The second-generation versions of PLINK will offer dramatic improvements in
performance and compatibility. For the first time, users without access to
high-end computing resources can perform several essential analyses of the
feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil
Systematizing Genome Privacy Research: A Privacy-Enhancing Technologies Perspective
Rapid advances in human genomics are enabling researchers to gain a better
understanding of the role of the genome in our health and well-being,
stimulating hope for more effective and cost efficient healthcare. However,
this also prompts a number of security and privacy concerns stemming from the
distinctive characteristics of genomic data. To address them, a new research
community has emerged and produced a large number of publications and
initiatives.
In this paper, we rely on a structured methodology to contextualize and
provide a critical analysis of the current knowledge on privacy-enhancing
technologies used for testing, storing, and sharing genomic data, using a
representative sample of the work published in the past decade. We identify and
discuss limitations, technical challenges, and issues faced by the community,
focusing in particular on those that are inherently tied to the nature of the
problem and are harder for the community alone to address. Finally, we report
on the importance and difficulty of the identified challenges based on an
online survey of genome data privacy expertsComment: To appear in the Proceedings on Privacy Enhancing Technologies
(PoPETs), Vol. 2019, Issue
REISCH: incorporating lightweight and reliable algorithms into healthcare applications of WSNs
Healthcare institutions require advanced technology to collect patients' data accurately and continuously. The tradition technologies still suffer from two problems: performance and security efficiency. The existing research has serious drawbacks when using public-key mechanisms such as digital signature algorithms. In this paper, we propose Reliable and Efficient Integrity Scheme for Data Collection in HWSN (REISCH) to alleviate these problems by using secure and lightweight signature algorithms. The results of the performance analysis indicate that our scheme provides high efficiency in data integration between sensors and server (saves more than 24% of alive sensors compared to traditional algorithms). Additionally, we use Automated Validation of Internet Security Protocols and Applications (AVISPA) to validate the security procedures in our scheme. Security analysis results confirm that REISCH is safe against some well-known attacks
Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.
We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution
On optimally partitioning a text to improve its compression
In this paper we investigate the problem of partitioning an input string T in
such a way that compressing individually its parts via a base-compressor C gets
a compressed output that is shorter than applying C over the entire T at once.
This problem was introduced in the context of table compression, and then
further elaborated and extended to strings and trees. Unfortunately, the
literature offers poor solutions: namely, we know either a cubic-time algorithm
for computing the optimal partition based on dynamic programming, or few
heuristics that do not guarantee any bounds on the efficacy of their computed
partition, or algorithms that are efficient but work in some specific scenarios
(such as the Burrows-Wheeler Transform) and achieve compression performance
that might be worse than the optimal-partitioning by a
factor. Therefore, computing efficiently the optimal solution is still open. In
this paper we provide the first algorithm which is guaranteed to compute in
O(n \log_{1+\eps}n) time a partition of T whose compressed output is
guaranteed to be no more than -worse the optimal one, where
may be any positive constant
- …