5 research outputs found
Efficient clustering and document retrival by query keywords
User penchants are shown by a set of keywords. A central server monitors the document stream and continuously reports to each user the top-k documents that are most relevant to her keywords. Our unprejudiced is to backing large numbers of users and high stream rates, while energizing the top-k results almost instantly. Our clarification walks out on the customary frequency-ordered indexing approach. As an alternative, it trails an identifier-ordering paradigm that ensembles better the nature of the problem. When supplemented with a new, locally adaptive method, our method offers confirmed optimality the number of well-thought-out queries per stream event, and direction of extent shorter retort time than the contemporary state-of-the-art
Building and Maintaining Halls of Fame over a Database
Halls of Fame are fascinating constructs. They represent the elite of an
often very large amount of entities---persons, companies, products, countries
etc. Beyond their practical use as static rankings, changes to them are
particularly interesting---for decision making processes, as input to common
media or novel narrative science applications, or simply consumed by users. In
this work, we aim at detecting events that can be characterized by changes to a
Hall of Fame ranking in an automated way. We describe how the schema and data
of a database can be used to generate Halls of Fame. In this database scenario,
by Hall of Fame we refer to distinguished tuples; entities, whose
characteristics set them apart from the majority. We define every Hall of Fame
as one specific instance of an SQL query, such that a change in its result is
considered a noteworthy event. Identified changes (i.e., events) are ranked
using lexicographic tradeoffs over event and query properties and presented to
users or fed in higher-level applications. We have implemented a full-fledged
prototype system that uses either database triggers or a Java based middleware
for event identification. We report on an experimental evaluation using a
real-world dataset of basketball statistics
Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs
My thesis would focus on analyzing the estimation of node similarity in streaming bipartite
graph. As an important model in many applications of data mining, the bipartite
graph represents the relationships between two sets of non-interconnected nodes, e.g. customers
and the products/services they buy, users and the events/groups they get involved
in, individuals and the diseases that they are subject to, etc. In most of these cases, data is
naturally streaming over time.
The node similarity in my thesis is mainly referred to neighborhood-based similarity,
i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN
in terms of the CN score, its dense ranks, in which equal weight objects receive the same
rank and ranks are consecutive, and its fraction in full projection graph, which is also
called similarity graph. We find that, in real-world dataset, the pairs of nodes with large
value of CN only constitute a relatively quite small fraction. With this property, real-world
streaming bipartite graph provide an opportunity for space saving by weighted sampling,
which can preferentially select high weighted edges.
Therefore, in this thesis, we propose a new one pass scheme for sampling the projection
graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of
the CN similarity weights
Continuous Top-k Queries over Real-Time Web Streams
The Web has become a large-scale real-time information system forcing us to revise both how to effectively assess relevance of information for a user and how to efficiently implement information retrieval and dissemination functionality. To increase information relevance, Real-time Web applications such as Twitter and Facebook, extend content and social-graph relevance scores with " real-time " user generated events (e.g. re-tweets, replies, likes). To accommodate high arrival rates of information items and user events we explore a pub-lish/subscribe paradigm in which we index queries and update on the fly their results each time a new item and relevant events arrive. In this setting, we need to process continuous top-k text queries combining both static and dynamic scores. To the best of our knowledge, this is the first work addressing how non-predictable, dynamic scores can be handled in a continuous top-k query setting
Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs
My thesis would focus on analyzing the estimation of node similarity in streaming bipartite
graph. As an important model in many applications of data mining, the bipartite
graph represents the relationships between two sets of non-interconnected nodes, e.g. customers
and the products/services they buy, users and the events/groups they get involved
in, individuals and the diseases that they are subject to, etc. In most of these cases, data is
naturally streaming over time.
The node similarity in my thesis is mainly referred to neighborhood-based similarity,
i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN
in terms of the CN score, its dense ranks, in which equal weight objects receive the same
rank and ranks are consecutive, and its fraction in full projection graph, which is also
called similarity graph. We find that, in real-world dataset, the pairs of nodes with large
value of CN only constitute a relatively quite small fraction. With this property, real-world
streaming bipartite graph provide an opportunity for space saving by weighted sampling,
which can preferentially select high weighted edges.
Therefore, in this thesis, we propose a new one pass scheme for sampling the projection
graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of
the CN similarity weights