7,365 research outputs found
Scalable Probabilistic Similarity Ranking in Uncertain Databases (Technical Report)
This paper introduces a scalable approach for probabilistic top-k similarity
ranking on uncertain vector data. Each uncertain object is represented by a set
of vector instances that are assumed to be mutually-exclusive. The objective is
to rank the uncertain data according to their distance to a reference object.
We propose a framework that incrementally computes for each object instance and
ranking position, the probability of the object falling at that ranking
position. The resulting rank probability distribution can serve as input for
several state-of-the-art probabilistic ranking models. Existing approaches
compute this probability distribution by applying a dynamic programming
approach of quadratic complexity. In this paper we theoretically as well as
experimentally show that our framework reduces this to a linear-time complexity
while having the same memory requirements, facilitated by incremental accessing
of the uncertain vector instances in increasing order of their distance to the
reference object. Furthermore, we show how the output of our method can be used
to apply probabilistic top-k ranking for the objects, according to different
state-of-the-art definitions. We conduct an experimental evaluation on
synthetic and real data, which demonstrates the efficiency of our approach
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
Oblivious Bounds on the Probability of Boolean Functions
This paper develops upper and lower bounds for the probability of Boolean
functions by treating multiple occurrences of variables as independent and
assigning them new individual probabilities. We call this approach dissociation
and give an exact characterization of optimal oblivious bounds, i.e. when the
new probabilities are chosen independent of the probabilities of all other
variables. Our motivation comes from the weighted model counting problem (or,
equivalently, the problem of computing the probability of a Boolean function),
which is #P-hard in general. By performing several dissociations, one can
transform a Boolean formula whose probability is difficult to compute, into one
whose probability is easy to compute, and which is guaranteed to provide an
upper or lower bound on the probability of the original formula by choosing
appropriate probabilities for the dissociated variables. Our new bounds shed
light on the connection between previous relaxation-based and model-based
approximations and unify them as concrete choices in a larger design space. We
also show how our theory allows a standard relational database management
system (DBMS) to both upper and lower bound hard probabilistic queries in
guaranteed polynomial time.Comment: 34 pages, 14 figures, supersedes: http://arxiv.org/abs/1105.281
- …