165 research outputs found
Hierarchical Bin Buffering: Online Local Moments for Dynamic External Memory Arrays
Local moments are used for local regression, to compute statistical measures
such as sums, averages, and standard deviations, and to approximate probability
distributions. We consider the case where the data source is a very large I/O
array of size n and we want to compute the first N local moments, for some
constant N. Without precomputation, this requires O(n) time. We develop a
sequence of algorithms of increasing sophistication that use precomputation and
additional buffer space to speed up queries. The simpler algorithms partition
the I/O array into consecutive ranges called bins, and they are applicable not
only to local-moment queries, but also to algebraic queries (MAX, AVERAGE, SUM,
etc.). With N buffers of size sqrt{n}, time complexity drops to O(sqrt n). A
more sophisticated approach uses hierarchical buffering and has a logarithmic
time complexity (O(b log_b n)), when using N hierarchical buffers of size n/b.
Using Overlapped Bin Buffering, we show that only a single buffer is needed, as
with wavelet-based algorithms, but using much less storage. Applications exist
in multidimensional and statistical databases over massive data sets,
interactive image processing, and visualization
When Random Sampling Preserves Privacy
Abstract. Many organizations such as the U.S. Census publicly release samples of data that they collect about private citizens. These datasets are first anonymized using various techniques and then a small sample is released so as to enable âdo-it-yourself â calculations. This paper investigates the privacy of the second step of this process: sampling. We observe that rare values â values that occur with low frequency in the table â can be problematic from a privacy perspective. To our knowledge, this is the first work that quantitatively examines the relationship between the number of rare values in a table and the privacy in a released random sample. If we require É-privacy (where the larger É is, the worse the privacy guarantee) with probability at least 1 â ÎŽ, we say that 1 a value is rare if it occurs in at most Ă ( ) rows of the table (ignoring log É factors). If there are no rare values, then we establish a direct connection between sample size that is safe to release and privacy. Specifically, if we select each row of the table with probability at most É then the sample is O(É)-private with high probability. In the case that there are t rare values, then the sample is Ă(ÉÎŽ/t)-private with probability at least 1 â ÎŽ.
X-Stream: Edge-centric Graph Processing using Streaming Partitions
X-Stream is a system for processing both in-memory and out-of-core graphs on a single shared-memory machine. While retaining the scatter-gather programming model with state stored in the vertices, X-Stream is novel in (i) using an edge-centric rather than a vertex-centric implementation of this model, and (ii) streaming completely unordered edge lists rather than performing random access. This design is motivated by the fact that sequential bandwidth for all storage media (main memory, SSD, and magnetic disk) is substantially larger than random access bandwidth. We demonstrate that a large number of graph algorithms can be expressed using the edge-centric scatter-gather model. The resulting implementations scale well in terms of number of cores, in terms of number of I/O devices, and across different storage media. X-Stream competes favorably with existing systems for graph processing. Besides sequential access, we identify as one of the main contributors to better performance the fact that X-Stream does not need to sort edge lists during pre-processing
Lavoisier: A Low Altitude Balloon Network for Probing the Deep Atmosphere and Surface of Venus
The in-situ exploration of the low atmosphere and surface of Venus is clearly the next step of Venus exploration. Understanding the geochemistry of the low atmosphere, interacting with rocks, and the way the integrated Venus system evolved, under the combined effects of inner planet cooling and intense atmospheric greenhouse, is a major challenge of modern planetology. Due to the dense atmosphere (95 bars at the surface), balloon platforms offer an interesting means to transport and land in-situ measurement instruments. Due to the large Archimede force, a 2 cubic meter He-pressurized balloon floating at 10 km altitude may carry up to 60 kg of payload. LAVOISIER is a project submitted to ESA in 2000, in the follow up and spirit of the balloon deployed at cloud level by the Russian Vega mission in 1986. It is composed of a descent probe, for detailed noble gas and atmosphere composition analysis, and of a network of 3 balloons for geochemical and geophysical investigations at local, regional and global scales
Fading histograms in detecting distribution and concept changes
The remarkable number of real applications under
dynamic scenarios is driving a novel ability to generate and
gatherinformation.Nowadays,amassiveamountofinforma-
tion is generated at a high-speed rate, known as data streams.
Moreover, data are collected under evolving environments.
Due to memory restrictions, data must be promptly processed
and discarded immediately. Therefore, dealing with evolving
data streams raises two main questions: (i) how to remember
discarded data? and (ii) how to forget outdated data? To main-
tain an updated representation of the time-evolving data, this
paper proposes fading histograms. Regarding the dynamics
of nature, changes in data are detected through a windowing
scheme that compares data distributions computed by the
fading histograms: the adaptive cumulative windows model
(ACWM). The online monitoring of the distance between
data distributions is evaluated using a dissimilarity measure
based on the asymmetry of the KullbackâLeibler divergence.The experimental results support the ability of fading his-
tograms in providing an updated representation of data. Such
property works in favor of detecting distribution changes
with smaller detection delay time when compared with stan-
dard histograms. With respect to the detection of concept
changes, the ACWM is compared with 3 known algorithms
taken from the literature, using artificial data and using pub-
lic data sets, presenting better results. Furthermore, we the
proposed method was extended for multidimensional and the
experiments performed show the ability of the ACWM for
detecting distribution changes in these settings
- âŠ