1,100 research outputs found

    Massively-Parallel Break Detection for Satellite Data

    Full text link
    The field of remote sensing is nowadays faced with huge amounts of data. While this offers a variety of exciting research opportunities, it also yields significant challenges regarding both computation time and space requirements. In practice, the sheer data volumes render existing approaches too slow for processing and analyzing all the available data. This work aims at accelerating BFAST, one of the state-of-the-art methods for break detection given satellite image time series. In particular, we propose a massively-parallel implementation for BFAST that can effectively make use of modern parallel compute devices such as GPUs. Our experimental evaluation shows that the proposed GPU implementation is up to four orders of magnitude faster than the existing publicly available implementation and up to ten times faster than a corresponding multi-threaded CPU execution. The dramatic decrease in running time renders the analysis of significantly larger datasets possible in seconds or minutes instead of hours or days. We demonstrate the practical benefits of our implementations given both artificial and real datasets.Comment: 10 page

    Estimating the Creation and Removal Date of Fracking Ponds Using Trend Analysis of Landsat Imagery

    Full text link
    Hydraulic fracturing, or fracking, is a process of introducing liquid at high pressure to create fractures in shale rock formations, thus releasing natural gas. Flowback and produced water from fracking operations is typically stored in temporary open-air earthen impoundments, or frack ponds. Unfortunately, in the United States there is no public record of the location of impoundments, or the dates that impoundments are created or removed. In this study we use a dataset of drilling-related impoundments in Pennsylvania identified through the FrackFinder project led by SkyTruth, an environmental non-profit. For each impoundment location, we compiled all low cloud Landsat imagery from 2000 to 2016 and created a monthly time series for three bands: red, near-infrared (NIR), and the Normalized Difference Vegetation Index (NDVI). We identified the approximate date of creation and removal of impoundments from sudden breaks in the time series. To verify our method, we compared the results to date ranges derived from photointerpretation of all available historical imagery on Google Earth for a subset of impoundments. Based on our analysis, we found that the number of impoundments built annually increased rapidly from 2006 to 2010, and then slowed from 2010 to 2013. Since newer impoundments tend to be larger, however, the total impoundment area has continued to increase. The methods described in this study would be appropriate for finding the creation and removal date of a variety of industrial land use changes at known locations

    A comprehensive evaluation of alignment algorithms in the context of RNA-seq.

    Get PDF
    Transcriptome sequencing (RNA-Seq) overcomes limitations of previously used RNA quantification methods and provides one experimental framework for both high-throughput characterization and quantification of transcripts at the nucleotide level. The first step and a major challenge in the analysis of such experiments is the mapping of sequencing reads to a transcriptomic origin including the identification of splicing events. In recent years, a large number of such mapping algorithms have been developed, all of which have in common that they require algorithms for aligning a vast number of reads to genomic or transcriptomic sequences. Although the FM-index based aligner Bowtie has become a de facto standard within mapping pipelines, a much larger number of possible alignment algorithms have been developed also including other variants of FM-index based aligners. Accordingly, developers and users of RNA-seq mapping pipelines have the choice among a large number of available alignment algorithms. To provide guidance in the choice of alignment algorithms for these purposes, we evaluated the performance of 14 widely used alignment programs from three different algorithmic classes: algorithms using either hashing of the reference transcriptome, hashing of reads, or a compressed FM-index representation of the genome. Here, special emphasis was placed on both precision and recall and the performance for different read lengths and numbers of mismatches and indels in a read. Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete

    A Statistical Perspective on Algorithmic Leveraging

    Full text link
    One popular method for dealing with large-scale data sets is sampling. For example, by using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales rows/columns of data matrices to reduce the data size before performing computations on the subproblem. This method has been successful in improving computational efficiency of algorithms for matrix problems such as least-squares approximation, least absolute deviations approximation, and low-rank matrix approximation. Existing work has focused on algorithmic issues such as worst-case running times and numerical issues associated with providing high-quality implementations, but none of it addresses statistical aspects of this method. In this paper, we provide a simple yet effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model with a fixed number of predictors. We show that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. This result is particularly striking, given the well-known result that, from the algorithmic perspective of worst-case analysis, leverage-based sampling provides uniformly superior worst-case algorithmic results, when compared with uniform sampling. Based on these theoretical results, we propose and analyze two new leveraging algorithms. A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage-based algorithms and that the new algorithms achieve improved performance.Comment: 44 pages, 17 figure

    Compressed Spaced Suffix Arrays

    Full text link
    Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice

    A One-Sample Test for Normality with Kernel Methods

    Get PDF
    We propose a new one-sample test for normality in a Reproducing Kernel Hilbert Space (RKHS). Namely, we test the null-hypothesis of belonging to a given family of Gaussian distributions. Hence our procedure may be applied either to test data for normality or to test parameters (mean and covariance) if data are assumed Gaussian. Our test is based on the same principle as the MMD (Maximum Mean Discrepancy) which is usually used for two-sample tests such as homogeneity or independence testing. Our method makes use of a special kind of parametric bootstrap (typical of goodness-of-fit tests) which is computationally more efficient than standard parametric bootstrap. Moreover, an upper bound for the Type-II error highlights the dependence on influential quantities. Experiments illustrate the practical improvement allowed by our test in high-dimensional settings where common normality tests are known to fail. We also consider an application to covariance rank selection through a sequential procedure

    Validation of remotely-sensed evapotranspiration and NDWI using ground measurements at Riverlands, South Africa

    Get PDF
    Quantification of the water cycle components is key to managing water resources. Remote sensing techniques and products have recently been developed for the estimation of water balance variables. The objective of this study was to test the reliability of LandSAF (Land Surface Analyses Satellite Applications Facility) evapotranspiration (ET) and SPOT-Vegetation Normalised Difference Water Index (NDWI) by comparison with ground-based measurements. Evapotranspiration (both daily and 30 min) was successfully estimated with LandSAF products in a flat area dominated by fynbos vegetation (Riverlands, Western Cape) that was representative of the satellite image pixel at 3 km resolution. Correlation coefficients were 0.85 and 0.91 and linear regressions produced R2 of 0.72 and 0.75 for 30 min and daily ET, respectively. Ground-measurements of soil water content taken with capacitance sensors at 3 depths were related to NDWI obtained from 10-daily maximum value composites of SPOT-Vegetation images at a resolution of 1 km. Multiple regression models showed that NDWI relates well to soil water content after accounting for precipitation (adjusted R2 were 0.71, 0.59 and 0.54 for 10, 40 and 80 cm soil depth, respectively). Changes in NDWI trends in different land covers were detected in 14-year time series using the breaks for additive seasonal and trend (BFAST) methodology. Appropriate usage, awareness of limitations and correct interpretation of remote sensing data can facilitate water management and planning operations.Fil: Jovanovic, Nebo. Natural Resources and Environment; SudáfricaFil: García, César Luis. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Católica de Córdoba; ArgentinaFil: Bugan, Richard DH. Natural Resources and Environment; SudáfricaFil: Teich, Ingrid. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Departamento de Desarrollo Rural. Area de Estadística y Biometría; ArgentinaFil: Garcia Rodriguez, Carlos Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de Córdoba; Argentin

    BFAST: An Alignment Tool for Large Scale Genome Resequencing

    Get PDF
    BACKGROUND:The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation. METHODOLOGY:We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. CONCLUSIONS:We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net)
    • …
    corecore