14,059 research outputs found
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
A 2D based Partition Strategy for Solving Ranking under Team Context (RTP)
In this paper, we propose a 2D based partition method for solving the problem
of Ranking under Team Context(RTC) on datasets without a priori. We first map
the data into 2D space using its minimum and maximum value among all
dimensions. Then we construct window queries with consideration of current team
context. Besides, during the query mapping procedure, we can pre-prune some
tuples which are not top ranked ones. This pre-classified step will defer
processing those tuples and can save cost while providing solutions for the
problem. Experiments show that our algorithm performs well especially on large
datasets with correctness
Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics
Continued reliance on human operators for managing data centers is a major
impediment for them from ever reaching extreme dimensions. Large computer
systems in general, and data centers in particular, will ultimately be managed
using predictive computational and executable models obtained through
data-science tools, and at that point, the intervention of humans will be
limited to setting high-level goals and policies rather than performing
low-level operations. Data-driven autonomics, where management and control are
based on holistic predictive models that are built and updated using live data,
opens one possible path towards limiting the role of operators in data centers.
In this paper, we present a data-science study of a public Google dataset
collected in a 12K-node cluster with the goal of building and evaluating
predictive models for node failures. Our results support the practicality of a
data-driven approach by showing the effectiveness of predictive models based on
data found in typical data center logs. We use BigQuery, the big data SQL
platform from the Google Cloud suite, to process massive amounts of data and
generate a rich feature set characterizing node state over time. We describe
how an ensemble classifier can be built out of many Random Forest classifiers
each trained on these features, to predict if nodes will fail in a future
24-hour window. Our evaluation reveals that if we limit false positive rates to
5%, we can achieve true positive rates between 27% and 88% with precision
varying between 50% and 72%.This level of performance allows us to recover
large fraction of jobs' executions (by redirecting them to other nodes when a
failure of the present node is predicted) that would otherwise have been wasted
due to failures. [...
Analysis and Forecasting of Trending Topics in Online Media Streams
Among the vast information available on the web, social media streams capture
what people currently pay attention to and how they feel about certain topics.
Awareness of such trending topics plays a crucial role in multimedia systems
such as trend aware recommendation and automatic vocabulary selection for video
concept detection systems.
Correctly utilizing trending topics requires a better understanding of their
various characteristics in different social media streams. To this end, we
present the first comprehensive study across three major online and social
media streams, Twitter, Google, and Wikipedia, covering thousands of trending
topics during an observation period of an entire year. Our results indicate
that depending on one's requirements one does not necessarily have to turn to
Twitter for information about current events and that some media streams
strongly emphasize content of specific categories. As our second key
contribution, we further present a novel approach for the challenging task of
forecasting the life cycle of trending topics in the very moment they emerge.
Our fully automated approach is based on a nearest neighbor forecasting
technique exploiting our assumption that semantically similar topics exhibit
similar behavior.
We demonstrate on a large-scale dataset of Wikipedia page view statistics
that forecasts by the proposed approach are about 9-48k views closer to the
actual viewing statistics compared to baseline methods and achieve a mean
average percentage error of 45-19% for time periods of up to 14 days.Comment: ACM Multimedia 201
- …