29 research outputs found

    Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools

    Full text link
    This paper introduces a high-throughput software tool framework called {\it sam2bam} that enables users to significantly speedup pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156-186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize the multiple processors, available memory, high-bandwidth of storage, and hardware compression accelerators if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting the input data are provided by {\it plug-in} tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of NGS data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime for whole-genome sequencing data from about 20 hours to about nine minutes on the same system using up to 711 GB of memory

    Estimating Joint Probabilities without Combinatory Counting

    No full text
    Estimating joint probabilities plays an important role in many data mining and machine learning tasks. In this paper we introduce two methods, minAB and prodAB, to estimate joint probabilities. Both methods are based on a light-weight structure, partition support . The core idea is to maintain the partition support of itemsets over logically disjoint partitions and then use it to estimate joint probabilities of itemsets of higher cardinalities. We present extensive mathematical analyses on both methods and compare their performances on synthetic datasets. We also demonstrate a case study of using the estimation methods in Apriori algorithm for fast association mining. Moreover, we explore the usefulness of the estimation methods in other mining/learning tasks. Experimental results show the eectiveness of the estimation methods

    Estimating joint probabilities from marginal ones

    No full text
    Abstract. Estimating joint probabilities plays an important role in many data mining and machine learning tasks. In this paper we introduce two methods, minAB and prodAB, to estimate joint probabilities. Both methods are based on a light-weight structure, partition support. The core idea is to maintain the partition support of itemsets over logically disjoint partitions and then use it to estimate joint probabilities of itemsets of higher cardinalitiess. We present extensive mathematical analyses on both methods and compare their performances on synthetic datasets. We also demonstrate a case study of using the estimation methods in Apriori algorithm for fast association mining. Moreover, we explore the usefulness of the estimation methods in other mining/learning tasks [9]. Experimental results show the effectiveness of the estimation methods

    Long-Term Changes and Factors That Influence Changes in Thermal Discharge from Nuclear Power Plants in Daya Bay, China

    No full text
    Thermal discharge (i.e., warm water) from nuclear power plants (NPPs) in Daya Bay, China, was analyzed in this study. To determine temporal and spatial patterns as well as factors affecting thermal discharge, data were acquired by the Landsat series of remote-sensing satellites for the period 1993–2020. First, sea surface temperature (SST) data for waters off NPPs were retrieved from Landsat imagery using a radiative transfer equation in conjunction with a split-window algorithm. Then, retrieved SST data were used to analyze seasonal and interannual changes in areas affected by NPP thermal discharge, as well as the effects of NPP installed capacity, tides, and wind field on the diffusion of thermal discharge. Analysis of interannual changes revealed an increase in SST with an increase in NPP installed capacity, with the area affected by increased drainage outlet temperature increasing to different degrees. Sea surface temperature and NPP installed capacity were significantly linearly related. Both flood tides (peak spring and neap) and ebb tides (peak spring and neap) affected areas of warming zones, with ebb tides having greater effects. The total area of all warming zones in summer was approximately twice that in spring, regardless of whether winds were favorable (i.e., westerly) or adverse (i.e., easterly). The effects of tides on areas of warming zones exceeded those of winds

    Long-Term Changes and Factors That Influence Changes in Thermal Discharge from Nuclear Power Plants in Daya Bay, China

    No full text
    Thermal discharge (i.e., warm water) from nuclear power plants (NPPs) in Daya Bay, China, was analyzed in this study. To determine temporal and spatial patterns as well as factors affecting thermal discharge, data were acquired by the Landsat series of remote-sensing satellites for the period 1993–2020. First, sea surface temperature (SST) data for waters off NPPs were retrieved from Landsat imagery using a radiative transfer equation in conjunction with a split-window algorithm. Then, retrieved SST data were used to analyze seasonal and interannual changes in areas affected by NPP thermal discharge, as well as the effects of NPP installed capacity, tides, and wind field on the diffusion of thermal discharge. Analysis of interannual changes revealed an increase in SST with an increase in NPP installed capacity, with the area affected by increased drainage outlet temperature increasing to different degrees. Sea surface temperature and NPP installed capacity were significantly linearly related. Both flood tides (peak spring and neap) and ebb tides (peak spring and neap) affected areas of warming zones, with ebb tides having greater effects. The total area of all warming zones in summer was approximately twice that in spring, regardless of whether winds were favorable (i.e., westerly) or adverse (i.e., easterly). The effects of tides on areas of warming zones exceeded those of winds

    Vertical interconnects squeezing in symmetric 3D mesh Network-on-Chip

    No full text
    Abstract — Three-dimensional (3D) integration and Network-on-Chip (NoC) are both proposed to tackle the on-chip intercon-nect scaling problems, and extensive research efforts have been de-voted to the design challenges of combining both. Through-silicon via (TSV) is considered to be the most promising technology for 3D integration, however, TSV pads distributed across planar lay-ers occupy significant chip area and result in routing congestions. In addition, the yield of 3D integrated circuits decreased dramat-ically as the number of TSVs increases. For symmetric 3D mesh NoC, we observe that the TSVs ’ utilization is pretty low and adja-cent routers rarely transmit packets via their vertical channels (i.e. TSVs) at the same time. Based on this observation, we propose a novel TSV squeezing scheme to share TSVs among neighboring router in a time division multiplex mode, which greatly improves the utilization of TSVs. Experimental results show that the pro-posed method can save significant TSV footprint with negligible performance overhead.

    Retention-Aware DRAM Assembly and Repair for Future FGR Memories

    No full text

    Architecture for sam2bam with analyzer plug-ins.

    No full text
    <p>Alignment database is created when analyzer plug-ins are enabled. Binary alignments that are produced by SAM parsing are placed in either main memory or external storage so that they can later be used for generating compressed BAM files by using second half of pipeline. Alignment database has summarized information on binary alignments that is used by analyzer plug-ins.</p
    corecore