4,050 research outputs found
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
This paper introduces a high-throughput software tool framework called {\it
sam2bam} that enables users to significantly speedup pre-processing for
next-generation sequencing data. The sam2bam is especially efficient on
single-node multi-core large-memory systems. It can reduce the runtime of data
pre-processing in marking duplicate reads on a single node system by 156-186x
compared with de facto standard tools. The sam2bam consists of parallel
software components that can fully utilize the multiple processors, available
memory, high-bandwidth of storage, and hardware compression accelerators if
available.
The sam2bam provides file format conversion between well-known genome file
formats, from SAM to BAM, as a basic feature. Additional features such as
analyzing, filtering, and converting the input data are provided by {\it
plug-in} tools, e.g., duplicate marking, which can be attached to sam2bam at
runtime.
We demonstrated that sam2bam could significantly reduce the runtime of NGS
data pre-processing from about two hours to about one minute for a whole-exome
data set on a 16-core single-node system using up to 130 GB of memory. The
sam2bam could reduce the runtime for whole-genome sequencing data from about 20
hours to about nine minutes on the same system using up to 711 GB of memory
Evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts
RNA virus populations within samples are highly heterogeneous, containing a large number of minority sequence variants which can potentially be transmitted to other susceptible hosts. Consequently, consensus genome sequences provide an incomplete picture of the within- and between-host viral evolutionary dynamics during transmission. Foot-and-mouth disease virus (FMDV) is an RNA virus that can spread from primary sites of replication, via the systemic circulation, to found distinct sites of local infection at epithelial surfaces. Viral evolution in these different tissues occurs independently, each of them potentially providing a source of virus to seed subsequent transmission events. This study employed the Illumina Genome Analyzer platform to sequence 18 FMDV samples collected from a chain of sequentially infected cattle. These data generated snap-shots of the evolving viral population structures within different animals and tissues. Analyses of the mutation spectra revealed polymorphisms at frequencies >0.5% at between 21 and 146 sites across the genome for these samples, while 13 sites acquired mutations in excess of consensus frequency (50%). Analysis of polymorphism frequency revealed that a number of minority variants were transmitted during host-to-host infection events, while the size of the intra-host founder populations appeared to be smaller. These data indicate that viral population complexity is influenced by small intra-host bottlenecks and relatively large inter-host bottlenecks. The dynamics of minority variants are consistent with the actions of genetic drift rather than strong selection. These results provide novel insights into the evolution of FMDV that can be applied to reconstruct both intra- and inter-host transmission routes
Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data
There is a high prevalence of coronary artery disease (CAD) in patients with left bundle branch block (LBBB); however there are many other causes for this electrocardiographic abnormality. Non-invasive assessment of these patients remains difficult, and all commonly used modalities exhibit several drawbacks. This often leads to these patients undergoing invasive coronary angiography which may not have been necessary. In this review, we examine the uses and limitations of commonly performed non-invasive tests for diagnosis of CAD in patients with LBBB
Computer architecture for efficient algorithmic executions in real-time systems: New technology for avionics systems and advanced space vehicles
Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed
Genome-culture coevolution promotes rapid divergence of killer whale ecotypes.
Analysing population genomic data from killer whale ecotypes, which we estimate have globally radiated within less than 250,000 years, we show that genetic structuring including the segregation of potentially functional alleles is associated with socially inherited ecological niche. Reconstruction of ancestral demographic history revealed bottlenecks during founder events, likely promoting ecological divergence and genetic drift resulting in a wide range of genome-wide differentiation between pairs of allopatric and sympatric ecotypes. Functional enrichment analyses provided evidence for regional genomic divergence associated with habitat, dietary preferences and post-zygotic reproductive isolation. Our findings are consistent with expansion of small founder groups into novel niches by an initial plastic behavioural response, perpetuated by social learning imposing an altered natural selection regime. The study constitutes an important step towards an understanding of the complex interaction between demographic history, culture, ecological adaptation and evolution at the genomic level
Demography and the age of rare variants
Large whole-genome sequencing projects have provided access to much of the
rare variation in human populations, which is highly informative about
population structure and recent demography. Here, we show how the age of rare
variants can be estimated from patterns of haplotype sharing and how these ages
can be related to historical relationships between populations. We investigate
the distribution of the age of variants occurring exactly twice (f2 variants)
in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous
variation across populations. The median age of haplotypes carrying f2 variants
is 50 to 160 generations across populations within Europe or Asia, and 170 to
320 generations within Africa. Haplotypes shared between continents are much
older with median ages for haplotypes shared between Europe and Asia ranging
from 320 to 670 generations. The distribution of the ages of f2 haplotypes is
informative about their demography, revealing recent bottlenecks, ancient
splits, and more modern connections between populations. We see the signature
of selection in the observation that functional variants are significantly
younger than nonfunctional variants of the same frequency. This approach is
relatively insensitive to mutation rate and complements other nonparametric
methods for demographic inference.Comment: Revised versio
Recommended from our members
A tool of "barcoded viruses" to study influenza virus transmission dynamics
The aim of this study was to establish a novel version of powerful “barcode viruses” as a
tool for studying the replication and transmission dynamics of influenza virus in vitro
and in vivo.
Five barcoded APR8 viruses were firstly used to investigate infection kinetics (e.g.
single- and multi-hit events, particle clumping and temporal aspects of co-infection) in
vitro. This work demonstrated that the majority of infectious events in cell culture were
single-hit events, but a significant number of infections were initiated by more than one
virus particles (consistent with virus aggregation during release). Reassortment was
found to occur efficiently and ubiquitously when near-isogenic viruses co-infected cells.
The timing of asynchronous co-infection revealed that super-infection was possible if
the second virus encountered the cell within 4 hr of the first virus. The super-infecting
virus showed accelerated replication and enhanced yield, suggesting the second virus
can take advantage of the already initiated replication machinery. Beyond this time
point (coincident with the onset of progeny release from the first virus) the second virus
was blocked by the initial infecting viruses.
Five virus libraries carrying ~2000 individually identifiable variants were then
generated for in vivo study. Amplification of the viral libraries in Madin-Darby canine
kidney (MDCK) cells was achieved without substantial bottlenecking or preferential
selection of specific sequences. Thirdly, two pilot studies in pigs demonstrated that
intranasal inoculation resulted in substantial bottlenecking and a relatively small
proportion of the inoculum gave rise to productive infection. Consequently, distinct viral
populations were found in different nostrils and could persist over the course of the
infection due to anatomical partitioning. Distinct sub-populations could be distinguished
in other tissue sites (e.g. trachea and lung). Super-infection of individual pigs could
occur around 2 days following primary exposure. The identity of the donor pigs could be
determined by the barcode identities. In the first pilot study, around 600 variants were
seen in each donor pig directly inoculated with approximately 6000 variants of the
barcoded viruses. When a pig was co-housed with 3 donors, a typical transmission dose
of 73-151 variants were seen. To further study the transmission dose between a single
donor and recipient, a transmission dose defined as 30-60 on 2 days post contact (d.p.c)
and 20-50 on 3 d.p.c was observed.
To conclude, my PhD project has developed a powerful tool with a wide range of
applications in influenza biology, particularly for studying transmission dynamics in a
natural host system
Ultraplex- A rapid, flexible, all-in-one fastq demultiplexer [version 1; peer review- 1 approved]
BACKGROUND: The first step of virtually all next generation sequencing analysis involves the splitting of the raw sequencing data into separate files using sample-specific barcodes, a process known as “demultiplexing”. However, we found that existing software for this purpose was either too inflexible or too computationally intensive for fast, streamlined processing of raw, single end fastq files containing combinatorial barcodes. RESULTS: Here, we introduce a fast and uniquely flexible demultiplexer, named Ultraplex, which splits a raw FASTQ file containing barcodes either at a single end or at both 5’ and 3’ ends of reads, trims the sequencing adaptors and low-quality bases, and moves unique molecular identifiers (UMIs) into the read header, allowing subsequent removal of PCR duplicates. Ultraplex is able to perform such single or combinatorial demultiplexing on both single- and paired-end sequencing data, and can process an entire Illumina HiSeq lane, consisting of nearly 500 million reads, in less than 20 minutes. CONCLUSIONS: Ultraplex greatly reduces computational burden and pipeline complexity for the demultiplexing of complex sequencing libraries, such as those produced by various CLIP and ribosome profiling protocols, and is also very user friendly, enabling streamlined, robust data processing. Ultraplex is available on PyPi and Conda and via Github
Supervised cross-modal factor analysis for multiple modal data classification
In this paper we study the problem of learning from multiple modal data for
purpose of document classification. In this problem, each document is composed
two different modals of data, i.e., an image and a text. Cross-modal factor
analysis (CFA) has been proposed to project the two different modals of data to
a shared data space, so that the classification of a image or a text can be
performed directly in this space. A disadvantage of CFA is that it has ignored
the supervision information. In this paper, we improve CFA by incorporating the
supervision information to represent and classify both image and text modals of
documents. We project both image and text data to a shared data space by factor
analysis, and then train a class label predictor in the shared space to use the
class label information. The factor analysis parameter and the predictor
parameter are learned jointly by solving one single objective function. With
this objective function, we minimize the distance between the projections of
image and text of the same document, and the classification error of the
projection measured by hinge loss function. The objective function is optimized
by an alternate optimization strategy in an iterative algorithm. Experiments in
two different multiple modal document data sets show the advantage of the
proposed algorithm over other CFA methods
Host-selected mutations converging on a global regulator drive an adaptive leap towards symbiosis in bacteria
Host immune and physical barriers protect against pathogens but also impede the establishment of essential symbiotic partnerships. To reveal mechanisms by which beneficial organisms adapt to circumvent host defenses, we experimentally evolved ecologically distinct bioluminescent Vibrio fischeri by colonization and growth within the light organs of the squid Euprymna scolopes. Serial squid passaging of bacteria produced eight distinct mutations in the binK sensor kinase gene, which conferred an exceptional selective advantage that could be demonstrated through both empirical and theoretical analysis. Squid-adaptive binK alleles promoted colonization and immune evasion that were mediated by cell-associated matrices including symbiotic polysaccharide (Syp) and cellulose. binK variation also altered quorum sensing, raising the threshold for luminescence induction. Preexisting coordinated regulation of symbiosis traits by BinK presented an efficient solution where altered BinK function was the key to unlock multiple colonization barriers. These results identify a genetic basis for microbial adaptability and underscore the importance of hosts as selective agents that shape emergent symbiont populations
- …