29 research outputs found
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools
This paper introduces a high-throughput software tool framework called {\it
sam2bam} that enables users to significantly speedup pre-processing for
next-generation sequencing data. The sam2bam is especially efficient on
single-node multi-core large-memory systems. It can reduce the runtime of data
pre-processing in marking duplicate reads on a single node system by 156-186x
compared with de facto standard tools. The sam2bam consists of parallel
software components that can fully utilize the multiple processors, available
memory, high-bandwidth of storage, and hardware compression accelerators if
available.
The sam2bam provides file format conversion between well-known genome file
formats, from SAM to BAM, as a basic feature. Additional features such as
analyzing, filtering, and converting the input data are provided by {\it
plug-in} tools, e.g., duplicate marking, which can be attached to sam2bam at
runtime.
We demonstrated that sam2bam could significantly reduce the runtime of NGS
data pre-processing from about two hours to about one minute for a whole-exome
data set on a 16-core single-node system using up to 130 GB of memory. The
sam2bam could reduce the runtime for whole-genome sequencing data from about 20
hours to about nine minutes on the same system using up to 711 GB of memory
Estimating Joint Probabilities without Combinatory Counting
Estimating joint probabilities plays an important role in many data mining and machine learning tasks. In this paper we introduce two methods, minAB and prodAB, to estimate joint probabilities. Both methods are based on a light-weight structure, partition support . The core idea is to maintain the partition support of itemsets over logically disjoint partitions and then use it to estimate joint probabilities of itemsets of higher cardinalities. We present extensive mathematical analyses on both methods and compare their performances on synthetic datasets. We also demonstrate a case study of using the estimation methods in Apriori algorithm for fast association mining. Moreover, we explore the usefulness of the estimation methods in other mining/learning tasks. Experimental results show the eectiveness of the estimation methods
Estimating joint probabilities from marginal ones
Abstract. Estimating joint probabilities plays an important role in many data mining and machine learning tasks. In this paper we introduce two methods, minAB and prodAB, to estimate joint probabilities. Both methods are based on a light-weight structure, partition support. The core idea is to maintain the partition support of itemsets over logically disjoint partitions and then use it to estimate joint probabilities of itemsets of higher cardinalitiess. We present extensive mathematical analyses on both methods and compare their performances on synthetic datasets. We also demonstrate a case study of using the estimation methods in Apriori algorithm for fast association mining. Moreover, we explore the usefulness of the estimation methods in other mining/learning tasks [9]. Experimental results show the effectiveness of the estimation methods
Long-Term Changes and Factors That Influence Changes in Thermal Discharge from Nuclear Power Plants in Daya Bay, China
Thermal discharge (i.e., warm water) from nuclear power plants (NPPs) in Daya Bay, China, was analyzed in this study. To determine temporal and spatial patterns as well as factors affecting thermal discharge, data were acquired by the Landsat series of remote-sensing satellites for the period 1993–2020. First, sea surface temperature (SST) data for waters off NPPs were retrieved from Landsat imagery using a radiative transfer equation in conjunction with a split-window algorithm. Then, retrieved SST data were used to analyze seasonal and interannual changes in areas affected by NPP thermal discharge, as well as the effects of NPP installed capacity, tides, and wind field on the diffusion of thermal discharge. Analysis of interannual changes revealed an increase in SST with an increase in NPP installed capacity, with the area affected by increased drainage outlet temperature increasing to different degrees. Sea surface temperature and NPP installed capacity were significantly linearly related. Both flood tides (peak spring and neap) and ebb tides (peak spring and neap) affected areas of warming zones, with ebb tides having greater effects. The total area of all warming zones in summer was approximately twice that in spring, regardless of whether winds were favorable (i.e., westerly) or adverse (i.e., easterly). The effects of tides on areas of warming zones exceeded those of winds
Long-Term Changes and Factors That Influence Changes in Thermal Discharge from Nuclear Power Plants in Daya Bay, China
Thermal discharge (i.e., warm water) from nuclear power plants (NPPs) in Daya Bay, China, was analyzed in this study. To determine temporal and spatial patterns as well as factors affecting thermal discharge, data were acquired by the Landsat series of remote-sensing satellites for the period 1993–2020. First, sea surface temperature (SST) data for waters off NPPs were retrieved from Landsat imagery using a radiative transfer equation in conjunction with a split-window algorithm. Then, retrieved SST data were used to analyze seasonal and interannual changes in areas affected by NPP thermal discharge, as well as the effects of NPP installed capacity, tides, and wind field on the diffusion of thermal discharge. Analysis of interannual changes revealed an increase in SST with an increase in NPP installed capacity, with the area affected by increased drainage outlet temperature increasing to different degrees. Sea surface temperature and NPP installed capacity were significantly linearly related. Both flood tides (peak spring and neap) and ebb tides (peak spring and neap) affected areas of warming zones, with ebb tides having greater effects. The total area of all warming zones in summer was approximately twice that in spring, regardless of whether winds were favorable (i.e., westerly) or adverse (i.e., easterly). The effects of tides on areas of warming zones exceeded those of winds
Vertical interconnects squeezing in symmetric 3D mesh Network-on-Chip
Abstract — Three-dimensional (3D) integration and Network-on-Chip (NoC) are both proposed to tackle the on-chip intercon-nect scaling problems, and extensive research efforts have been de-voted to the design challenges of combining both. Through-silicon via (TSV) is considered to be the most promising technology for 3D integration, however, TSV pads distributed across planar lay-ers occupy significant chip area and result in routing congestions. In addition, the yield of 3D integrated circuits decreased dramat-ically as the number of TSVs increases. For symmetric 3D mesh NoC, we observe that the TSVs ’ utilization is pretty low and adja-cent routers rarely transmit packets via their vertical channels (i.e. TSVs) at the same time. Based on this observation, we propose a novel TSV squeezing scheme to share TSVs among neighboring router in a time division multiplex mode, which greatly improves the utilization of TSVs. Experimental results show that the pro-posed method can save significant TSV footprint with negligible performance overhead.
Potential Misidentification of Love-Wave Phase Velocity Based on Three-Component Ambient Seismic Noise
Architecture for sam2bam with analyzer plug-ins.
<p>Alignment database is created when analyzer plug-ins are enabled. Binary alignments that are produced by SAM parsing are placed in either main memory or external storage so that they can later be used for generating compressed BAM files by using second half of pipeline. Alignment database has summarized information on binary alignments that is used by analyzer plug-ins.</p
Recommended from our members