1,100 research outputs found
Massively-Parallel Break Detection for Satellite Data
The field of remote sensing is nowadays faced with huge amounts of data.
While this offers a variety of exciting research opportunities, it also yields
significant challenges regarding both computation time and space requirements.
In practice, the sheer data volumes render existing approaches too slow for
processing and analyzing all the available data. This work aims at accelerating
BFAST, one of the state-of-the-art methods for break detection given satellite
image time series. In particular, we propose a massively-parallel
implementation for BFAST that can effectively make use of modern parallel
compute devices such as GPUs. Our experimental evaluation shows that the
proposed GPU implementation is up to four orders of magnitude faster than the
existing publicly available implementation and up to ten times faster than a
corresponding multi-threaded CPU execution. The dramatic decrease in running
time renders the analysis of significantly larger datasets possible in seconds
or minutes instead of hours or days. We demonstrate the practical benefits of
our implementations given both artificial and real datasets.Comment: 10 page
Estimating the Creation and Removal Date of Fracking Ponds Using Trend Analysis of Landsat Imagery
Hydraulic fracturing, or fracking, is a process of introducing liquid at high pressure to create fractures in shale rock formations, thus releasing natural gas. Flowback and produced water from fracking operations is typically stored in temporary open-air earthen impoundments, or frack ponds. Unfortunately, in the United States there is no public record of the location of impoundments, or the dates that impoundments are created or removed. In this study we use a dataset of drilling-related impoundments in Pennsylvania identified through the FrackFinder project led by SkyTruth, an environmental non-profit. For each impoundment location, we compiled all low cloud Landsat imagery from 2000 to 2016 and created a monthly time series for three bands: red, near-infrared (NIR), and the Normalized Difference Vegetation Index (NDVI). We identified the approximate date of creation and removal of impoundments from sudden breaks in the time series. To verify our method, we compared the results to date ranges derived from photointerpretation of all available historical imagery on Google Earth for a subset of impoundments. Based on our analysis, we found that the number of impoundments built annually increased rapidly from 2006 to 2010, and then slowed from 2010 to 2013. Since newer impoundments tend to be larger, however, the total impoundment area has continued to increase. The methods described in this study would be appropriate for finding the creation and removal date of a variety of industrial land use changes at known locations
A comprehensive evaluation of alignment algorithms in the context of RNA-seq.
Transcriptome sequencing (RNA-Seq) overcomes limitations of previously used RNA quantification methods and provides one experimental framework for both high-throughput characterization and quantification of transcripts at the nucleotide level. The first step and a major challenge in the analysis of such experiments is the mapping of sequencing reads to a transcriptomic origin including the identification of splicing events. In recent years, a large number of such mapping algorithms have been developed, all of which have in common that they require algorithms for aligning a vast number of reads to genomic or transcriptomic sequences. Although the FM-index based aligner Bowtie has become a de facto standard within mapping pipelines, a much larger number of possible alignment algorithms have been developed also including other variants of FM-index based aligners. Accordingly, developers and users of RNA-seq mapping pipelines have the choice among a large number of available alignment algorithms. To provide guidance in the choice of alignment algorithms for these purposes, we evaluated the performance of 14 widely used alignment programs from three different algorithmic classes: algorithms using either hashing of the reference transcriptome, hashing of reads, or a compressed FM-index representation of the genome. Here, special emphasis was placed on both precision and recall and the performance for different read lengths and numbers of mismatches and indels in a read. Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete
A Statistical Perspective on Algorithmic Leveraging
One popular method for dealing with large-scale data sets is sampling. For
example, by using the empirical statistical leverage scores as an importance
sampling distribution, the method of algorithmic leveraging samples and
rescales rows/columns of data matrices to reduce the data size before
performing computations on the subproblem. This method has been successful in
improving computational efficiency of algorithms for matrix problems such as
least-squares approximation, least absolute deviations approximation, and
low-rank matrix approximation. Existing work has focused on algorithmic issues
such as worst-case running times and numerical issues associated with providing
high-quality implementations, but none of it addresses statistical aspects of
this method.
In this paper, we provide a simple yet effective framework to evaluate the
statistical properties of algorithmic leveraging in the context of estimating
parameters in a linear regression model with a fixed number of predictors. We
show that from the statistical perspective of bias and variance, neither
leverage-based sampling nor uniform sampling dominates the other. This result
is particularly striking, given the well-known result that, from the
algorithmic perspective of worst-case analysis, leverage-based sampling
provides uniformly superior worst-case algorithmic results, when compared with
uniform sampling. Based on these theoretical results, we propose and analyze
two new leveraging algorithms. A detailed empirical evaluation of existing
leverage-based methods as well as these two new methods is carried out on both
synthetic and real data sets. The empirical results indicate that our theory is
a good predictor of practical performance of existing and new leverage-based
algorithms and that the new algorithms achieve improved performance.Comment: 44 pages, 17 figure
Compressed Spaced Suffix Arrays
Spaced seeds are important tools for similarity search in bioinformatics, and
using several seeds together often significantly improves their performance.
With existing approaches, however, for each seed we keep a separate linear-size
data structure, either a hash table or a spaced suffix array (SSA). In this
paper we show how to compress SSAs relative to normal suffix arrays (SAs) and
still support fast random access to them. We first prove a theoretical upper
bound on the space needed to store an SSA when we already have the SA. We then
present experiments indicating that our approach works even better in practice
A One-Sample Test for Normality with Kernel Methods
We propose a new one-sample test for normality in a Reproducing Kernel
Hilbert Space (RKHS). Namely, we test the null-hypothesis of belonging to a
given family of Gaussian distributions. Hence our procedure may be applied
either to test data for normality or to test parameters (mean and covariance)
if data are assumed Gaussian. Our test is based on the same principle as the
MMD (Maximum Mean Discrepancy) which is usually used for two-sample tests such
as homogeneity or independence testing. Our method makes use of a special kind
of parametric bootstrap (typical of goodness-of-fit tests) which is
computationally more efficient than standard parametric bootstrap. Moreover, an
upper bound for the Type-II error highlights the dependence on influential
quantities. Experiments illustrate the practical improvement allowed by our
test in high-dimensional settings where common normality tests are known to
fail. We also consider an application to covariance rank selection through a
sequential procedure
Validation of remotely-sensed evapotranspiration and NDWI using ground measurements at Riverlands, South Africa
Quantification of the water cycle components is key to managing water resources. Remote sensing techniques and products have recently been developed for the estimation of water balance variables. The objective of this study was to test the reliability of LandSAF (Land Surface Analyses Satellite Applications Facility) evapotranspiration (ET) and SPOT-Vegetation Normalised Difference Water Index (NDWI) by comparison with ground-based measurements. Evapotranspiration (both daily and 30 min) was successfully estimated with LandSAF products in a flat area dominated by fynbos vegetation (Riverlands, Western Cape) that was representative of the satellite image pixel at 3 km resolution. Correlation coefficients were 0.85 and 0.91 and linear regressions produced R2 of 0.72 and 0.75 for 30 min and daily ET, respectively. Ground-measurements of soil water content taken with capacitance sensors at 3 depths were related to NDWI obtained from 10-daily maximum value composites of SPOT-Vegetation images at a resolution of 1 km. Multiple regression models showed that NDWI relates well to soil water content after accounting for precipitation (adjusted R2 were 0.71, 0.59 and 0.54 for 10, 40 and 80 cm soil depth, respectively). Changes in NDWI trends in different land covers were detected in 14-year time series using the breaks for additive seasonal and trend (BFAST) methodology. Appropriate usage, awareness of limitations and correct interpretation of remote sensing data can facilitate water management and planning operations.Fil: Jovanovic, Nebo. Natural Resources and Environment; SudáfricaFil: GarcĂa, CĂ©sar Luis. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; Argentina. Universidad CatĂłlica de CĂłrdoba; ArgentinaFil: Bugan, Richard DH. Natural Resources and Environment; SudáfricaFil: Teich, Ingrid. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; Argentina. Universidad Nacional de CĂłrdoba. Facultad de Ciencias Agropecuarias. Departamento de Desarrollo Rural. Area de EstadĂstica y BiometrĂa; ArgentinaFil: Garcia Rodriguez, Carlos Marcelo. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas; Argentina. Universidad Nacional de CĂłrdoba; Argentin
BFAST: An Alignment Tool for Large Scale Genome Resequencing
BACKGROUND:The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation. METHODOLOGY:We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. CONCLUSIONS:We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net)
- …