4,371 research outputs found
Better bitmap performance with Roaring bitmaps
Bitmap indexes are commonly used in databases and search engines. By
exploiting bit-level parallelism, they can significantly accelerate queries.
However, they can use much memory, and thus we might prefer compressed bitmap
indexes. Following Oracle's lead, bitmaps are often compressed using run-length
encoding (RLE). Building on prior work, we introduce the Roaring compressed
bitmap format: it uses packed arrays for compression instead of RLE. We compare
it to two high-performance RLE-based bitmap encoding techniques: WAH (Word
Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable
Integer Set). On synthetic and real data, we find that Roaring bitmaps (1)
often compress significantly better (e.g., 2 times) and (2) are faster than the
compressed alternatives (up to 900 times faster for intersections). Our results
challenge the view that RLE-based bitmap compression is best
Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element
The running maximum-minimum (max-min) filter computes the maxima and minima
over running windows of size w. This filter has numerous applications in signal
processing and time series analysis. We present an easy-to-implement online
algorithm requiring no more than 3 comparisons per element, in the worst case.
Comparatively, no algorithm is known to compute the running maximum (or
minimum) filter in 1.5 comparisons per element, in the worst case. Our
algorithm has reduced latency and memory usage.Comment: to appear in Nordic Journal of Computin
A Better Alternative to Piecewise Linear Time Series Segmentation
Time series are difficult to monitor, summarize and predict. Segmentation
organizes time series into few intervals having uniform characteristics
(flatness, linearity, modality, monotonicity and so on). For scalability, we
require fast linear time algorithms. The popular piecewise linear model can
determine where the data goes up or down and at what rate. Unfortunately, when
the data does not follow a linear model, the computation of the local slope
creates overfitting. We propose an adaptive time series model where the
polynomial degree of each interval vary (constant, linear and so on). Given a
number of regressors, the cost of each interval is its polynomial degree:
constant intervals cost 1 regressor, linear intervals cost 2 regressors, and so
on. Our goal is to minimize the Euclidean (l_2) error for a given model
complexity. Experimentally, we investigate the model where intervals can be
either constant or linear. Over synthetic random walks, historical stock market
prices, and electrocardiograms, the adaptive model provides a more accurate
segmentation than the piecewise linear model without increasing the
cross-validation error or the running time, while providing a richer vocabulary
to applications. Implementation issues, such as numerical stability and
real-world performance, are discussed.Comment: to appear in SIAM Data Mining 200
Is Advertising to Teenagers Ethical? Media’s Influence on Body Image and Behavior
An examination of the ethics involved in advertising to adolescents. Specifically, a content analysis and survey research was conducted dealing with how television commercials and magazine advertisements targeted towards males ultimately affect female body image and behavior. The content analysis consisted of Axe Body Spray advertisements, as well as Sports Illustrated: Swimsuit Edition. Findings of survey research include increased body monitoring as a result of exposure to advertisements. Implications and future opportunities are discussed
Scale And Translation Invariant Collaborative Filtering Systems
Collaborative filtering systems are prediction algorithms over sparse data sets of user preferences. We modify a wide range of state-of-the-art collaborative filtering systems to make them scale and translation invariant and generally improve their accuracy without increasing their computational cost. Using the EachMovie and the Jester data sets, we show that learning-free constant time scale and translation invariant schemes outperforms other learning-free constant time schemes by at least 3% and perform as well as expensive memory-based schemes (within 4%). Over the Jester data set, we show that a scale and translation invariant Eigentaste algorithm outperforms Eigentaste 2.0 by 20%. These results suggest that scale and translation invariance is a desirable property
Membership?
For graduate students and young faculty in the field of Canadian history, membership in the CHA is a right of passage, a token of their commitment to a chosen career and one of the important means of establishing professional ties within the wider academic community in Canada. For their colleagues in other areas of history this commitment is not so frequently made. Yet the advantages of CHA membership are many, advantages that are important to all historians working in
Canada
Extracting, Transforming and Archiving Scientific Data
It is becoming common to archive research datasets that are not only large
but also numerous. In addition, their corresponding metadata and the software
required to analyse or display them need to be archived. Yet the manual
curation of research data can be difficult and expensive, particularly in very
large digital repositories, hence the importance of models and tools for
automating digital curation tasks. The automation of these tasks faces three
major challenges: (1) research data and data sources are highly heterogeneous,
(2) future research needs are difficult to anticipate, (3) data is hard to
index. To address these problems, we propose the Extract, Transform and Archive
(ETA) model for managing and mechanizing the curation of research data.
Specifically, we propose a scalable strategy for addressing the research-data
problem, ranging from the extraction of legacy data to its long-term storage.
We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201
Faster Base64 Encoding and Decoding Using AVX2 Instructions
Web developers use base64 formats to include images, fonts, sounds and other
resources directly inside HTML, JavaScript, JSON and XML files. We estimate
that billions of base64 messages are decoded every day. We are motivated to
improve the efficiency of base64 encoding and decoding. Compared to
state-of-the-art implementations, we multiply the speeds of both the encoding
(~10x) and the decoding (~7x). We achieve these good results by using the
single-instruction-multiple-data (SIMD) instructions available on recent Intel
processors (AVX2). Our accelerated software abides by the specification and
reports errors when encountering characters outside of the base64 set. It is
available online as free software under a liberal license.Comment: software at https://github.com/lemire/fastbase6
Strongly universal string hashing is fast
We present fast strongly universal string hashing families: they can process
data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly, we find that
these families---though they require a large buffer of random numbers---are
often faster than popular hash functions with weaker theoretical guarantees.
Moreover, conventional wisdom is that hash functions with fewer multiplications
are faster. Yet we find that they may fail to be faster due to operation
pipelining. We present experimental results on several processors including
low-powered processors. Our tests include hash functions designed for
processors with the Carry-Less Multiplication (CLMUL) instruction set. We also
prove, using accessible proofs, the strong universality of our families.Comment: Software is available at
http://code.google.com/p/variablelengthstringhashing/ and
https://github.com/lemire/StronglyUniversalStringHashin
- …