5 research outputs found

    Efficient clustering and document retrival by query keywords

    Get PDF
    User penchants are shown by a set of keywords. A central server monitors the document stream and continuously reports to each user the top-k documents that are most relevant to her keywords. Our unprejudiced is to backing large numbers of users and high stream rates, while energizing the top-k results almost instantly. Our clarification walks out on the customary frequency-ordered indexing approach. As an alternative, it trails an identifier-ordering paradigm that ensembles better the nature of the problem. When supplemented with a new, locally adaptive method, our method offers confirmed optimality the number of well-thought-out queries per stream event, and direction of extent shorter retort time than the contemporary state-of-the-art

    Building and Maintaining Halls of Fame over a Database

    Full text link
    Halls of Fame are fascinating constructs. They represent the elite of an often very large amount of entities---persons, companies, products, countries etc. Beyond their practical use as static rankings, changes to them are particularly interesting---for decision making processes, as input to common media or novel narrative science applications, or simply consumed by users. In this work, we aim at detecting events that can be characterized by changes to a Hall of Fame ranking in an automated way. We describe how the schema and data of a database can be used to generate Halls of Fame. In this database scenario, by Hall of Fame we refer to distinguished tuples; entities, whose characteristics set them apart from the majority. We define every Hall of Fame as one specific instance of an SQL query, such that a change in its result is considered a noteworthy event. Identified changes (i.e., events) are ranked using lexicographic tradeoffs over event and query properties and presented to users or fed in higher-level applications. We have implemented a full-fledged prototype system that uses either database triggers or a Java based middleware for event identification. We report on an experimental evaluation using a real-world dataset of basketball statistics

    Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs

    Get PDF
    My thesis would focus on analyzing the estimation of node similarity in streaming bipartite graph. As an important model in many applications of data mining, the bipartite graph represents the relationships between two sets of non-interconnected nodes, e.g. customers and the products/services they buy, users and the events/groups they get involved in, individuals and the diseases that they are subject to, etc. In most of these cases, data is naturally streaming over time. The node similarity in my thesis is mainly referred to neighborhood-based similarity, i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN in terms of the CN score, its dense ranks, in which equal weight objects receive the same rank and ranks are consecutive, and its fraction in full projection graph, which is also called similarity graph. We find that, in real-world dataset, the pairs of nodes with large value of CN only constitute a relatively quite small fraction. With this property, real-world streaming bipartite graph provide an opportunity for space saving by weighted sampling, which can preferentially select high weighted edges. Therefore, in this thesis, we propose a new one pass scheme for sampling the projection graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of the CN similarity weights

    Continuous Top-k Queries over Real-Time Web Streams

    Get PDF
    The Web has become a large-scale real-time information system forcing us to revise both how to effectively assess relevance of information for a user and how to efficiently implement information retrieval and dissemination functionality. To increase information relevance, Real-time Web applications such as Twitter and Facebook, extend content and social-graph relevance scores with " real-time " user generated events (e.g. re-tweets, replies, likes). To accommodate high arrival rates of information items and user events we explore a pub-lish/subscribe paradigm in which we index queries and update on the fly their results each time a new item and relevant events arrive. In this setting, we need to process continuous top-k text queries combining both static and dynamic scores. To the best of our knowledge, this is the first work addressing how non-predictable, dynamic scores can be handled in a continuous top-k query setting

    Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs

    Get PDF
    My thesis would focus on analyzing the estimation of node similarity in streaming bipartite graph. As an important model in many applications of data mining, the bipartite graph represents the relationships between two sets of non-interconnected nodes, e.g. customers and the products/services they buy, users and the events/groups they get involved in, individuals and the diseases that they are subject to, etc. In most of these cases, data is naturally streaming over time. The node similarity in my thesis is mainly referred to neighborhood-based similarity, i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN in terms of the CN score, its dense ranks, in which equal weight objects receive the same rank and ranks are consecutive, and its fraction in full projection graph, which is also called similarity graph. We find that, in real-world dataset, the pairs of nodes with large value of CN only constitute a relatively quite small fraction. With this property, real-world streaming bipartite graph provide an opportunity for space saving by weighted sampling, which can preferentially select high weighted edges. Therefore, in this thesis, we propose a new one pass scheme for sampling the projection graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of the CN similarity weights
    corecore