4,637 research outputs found

    Approximative filtering of XML documents in a publish/subscribe system

    Get PDF
    Publish/subscribe systems filter published documents and inform their subscribers about documents matching their interests. Recent systems have focussed on documents or messages sent in XML format. Subscribers have to be familiar with the underlying XML format to create meaningful subscriptions. A service might support several providers with slightly differing formats, e.g., several publishers of books. This makes the definition of a successful subscription almost impossible. This paper proposes the use of an approximative language for subscriptions. We introduce the design of our ApproXFilter algorithm for approximative filtering in a publish/subscribe system. We present the results of our performance analysis of a prototypical implementation

    A survey on tree matching and XML retrieval

    Get PDF
    International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval

    When Things Matter: A Data-Centric View of the Internet of Things

    Full text link
    With the recent advances in radio-frequency identification (RFID), low-cost wireless sensor devices, and Web technologies, the Internet of Things (IoT) approach has gained momentum in connecting everyday objects to the Internet and facilitating machine-to-human and machine-to-machine communication with the physical world. While IoT offers the capability to connect and integrate both digital and physical entities, enabling a whole new class of applications and services, several significant challenges need to be addressed before these applications and services can be fully realized. A fundamental challenge centers around managing IoT data, typically produced in dynamic and volatile environments, which is not only extremely large in scale and volume, but also noisy, and continuous. This article surveys the main techniques and state-of-the-art research efforts in IoT from data-centric perspectives, including data stream processing, data storage models, complex event processing, and searching in IoT. Open research issues for IoT data management are also discussed

    Optimal algorithms for selecting top-k combinations of attributes : theory and applications

    Get PDF
    Traditional top-k algorithms, e.g., TA and NRA, have been successfully applied in many areas such as information retrieval, data mining and databases. They are designed to discover k objects, e.g., top-k restaurants, with highest overall scores aggregated from different attributes, e.g., price and location. However, new emerging applications like query recommendation require providing the best combinations of attributes, instead of objects. The straightforward extension based on the existing top-k algorithms is prohibitively expensive to answer top-k combinations because they need to enumerate all the possible combinations, which is exponential to the number of attributes. In this article, we formalize a novel type of top-k query, called top-k, m, which aims to find top-k combinations of attributes based on the overall scores of the top-m objects within each combination, where m is the number of objects forming a combination. We propose a family of efficient top-k, m algorithms with different data access methods, i.e., sorted accesses and random accesses and different query certainties, i.e., exact query processing and approximate query processing. Theoretically, we prove that our algorithms are instance optimal and analyze the bound of the depth of accesses. We further develop optimizations for efficient query evaluation to reduce the computational and the memory costs and the number of accesses. We provide a case study on the real applications of top-k, m queries for an online biomedical search engine. Finally, we perform comprehensive experiments to demonstrate the scalability and efficiency of top-k, m algorithms on multiple real-life datasets.Peer reviewe

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Searching digital music libraries

    Get PDF
    There has been a recent explosion of interest in digital music libraries. In particular, interactive melody retrieval is a striking example of a search paradigm that differs radically from the standard full-text search. Many different techniques have been proposed for melody matching, but the area lacks standard databases that allow them to be compared on common grounds––and copyright issues have stymied attempts to develop such a corpus. This paper focuses on methods for evaluating different symbolic music matching strategies, and describes a series of experiments that compare and contrast results obtained using three dominant paradigms. Combining two of these paradigms yields a hybrid approach which is shown to have the best overall combination of efficiency and effectiveness

    Toward Entity-Aware Search

    Get PDF
    As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
    corecore