12 research outputs found

    Sixth Biennial Report : August 2001 - May 2003

    No full text

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Eight Biennial Report : April 2005 – March 2007

    No full text

    Advanced methods for query routing in peer-to-peer information retrieval

    Get PDF
    One of the most challenging problems in peer-to-peer networks is query routing: effectively and efficiently identifying peers that can return high-quality local results for a given query. Existing methods from the areas of distributed information retrieval and metasearch engines do not adequately address the peculiarities of a peer-to-peer network. The main contributions of this thesis are as follows: 1. Methods for query routing that take into account the mutual overlap of different peers\u27; collections, 2. Methods for query routing that take into account the correlations between multiple terms, 3. Comparative evaluation of different query routing methods. Our experiments confirm the superiority of our novel query routing methods over the prior state-of-the-art, in particular in the context of peer-to-peer Web search.Eines der drängendsten Probleme in Peer-to-Peer-Netzwerken ist Query-Routing: das effektive und effiziente Identifizieren solcher Peers, die qualitativ hochwertige lokale Ergebnisse zu einer gegebenen Anfrage liefern können. Die bisher bekannten Verfahren aus dem Bereich der verteilten Informationssuche sowie der Metasuchmaschinen werden den Besonderheiten von Peer-to-Peer-Netzwerken nicht gerecht. Die Hautbeiträge dieser Arbeit teilen sich in folgende Schwerpunkte: 1. Query-Routing unter Berücksichtigung der gegenseitigen überlappung der Kollektionen verschiedener Peers, 2. Query-Routing unter Berücksichtigung der Korrelationen zwischen verschiedenen Termen, 3. Vergleichende Evaluierung verschiedener Methoden zum Query-Routing. Unsere Experimente bestätigen die Überlegenheit der in dieser Arbeit entwickelten Verfahren gegenüber den bisher bekannten Verfahren, insbesondere im Kontext von Peer-to-Peer-Websuche

    Indexing collections of XML documents with arbitrary links

    Get PDF
    In recent years, the popularity of XML has increased significantly. XML is the extensible markup language of the World Wide Web Consortium (W3C). XML is used to represent data in many areas, such as traditional database management systems, e-business environments, and the World Wide Web. XML data, unlike relational and object-oriented data, has no fixed schema known in advance and is stored separately from the data. XML data is self-describing and can model heterogeneity more naturally than relational or object-oriented data models. Moreover, XML data usually has XLinks or XPointers to data in other documents (e.g., global-links). In addition to XLink or XPointer links, the XML standard allows to add internal-links between different elements in the same XML document using the ID/IDREF attributes. The rise in popularity of XML has generated much interest in query processing over graph-structured data. In order to facilitate efficient evaluation of path expressions, structured indexes have been proposed. However, most variants of structured indexes ignore global- or interior-document references. They assume a tree-like structure of XML-documents, which do not contain such global-and internal-links. Extending these indexes to work with large XML graphs considering of global- or internal-document links, firstly requires a lot of computing power for the creation process. Secondly, this would also require a great deal of space in which to store the indexes. As a latter demonstrates, the efficient evaluation of ancestors-descendants queries over arbitrary graphs with long paths is indeed a complex issue. This thesis proposes the HID index (2-Hop cover path Index based on DAG) is based on the concept of a two-hop cover for a directed graph. The algorithms proposed for the HID index creation, in effect, scales down the original graph size substantially. As a result, a directed acyclic graph (DAG) with a smaller number of nodes and edges will emerge. This reduces the number of computing steps required for building the index. In addition to this, computing time and space will be reduced as well. The index also permits to efficiently evaluate ancestors-descendants relationships. Moreover, the proposed index has an advantage over other comparable indexes: it is optimized for descendants- or-self queries on arbitrary graphs with link relationship, a task that would stress any index structures. Our experiments with real life XML data show that, the HID index provides better performance than other indexes

    Äriprotsessi tulemuste ennustav ja korralduslik seire

    Get PDF
    Viimastel aastatel on erinevates valdkondades tegutsevad ettevõtted üles näidanud kasvavat huvi masinõppel põhinevate rakenduste kasutusele võtmiseks. Muuhulgas otsitakse võimalusi oma äriprotsesside efektiivsuse tõstmiseks, kasutades ennustusmudeleid protsesside jooksvaks seireks. Sellised ennustava protsessiseire meetodid võtavad sisendiks sündmuslogi, mis koosneb hulgast lõpetatud äriprotsessi juhtumite sündmusjadadest, ning kasutavad masinõppe algoritme ennustusmudelite treenimiseks. Saadud mudelid teevad ennustusi lõpetamata (antud ajahetkel aktiivsete) protsessijuhtumite jaoks, võttes sisendiks sündmuste jada, mis selle hetkeni on toimunud ning ennustades kas järgmist sündmust antud juhtumis, juhtumi lõppemiseni jäänud aega või instantsi lõpptulemust. Lõpptulemusele orienteeritud ennustava protsessiseire meetodid keskenduvad ennustamisele, kas protsessijuhtum lõppeb soovitud või ebasoovitava lõpptulemusega. Süsteemi kasutaja saab ennustuste alusel otsustada, kas sekkuda antud protsessijuhtumisse või mitte, eesmärgiga ära hoida ebasoovitavat lõpptulemust või leevendada selle negatiivseid tagajärgi. Erinevalt puhtalt ennustavatest süsteemidest annavad korralduslikud protsessiseire meetodid kasutajale ka soovitusi, kas ja kuidas antud juhtumisse sekkuda, eesmärgiga optimeerida mingit kindlat kasulikkusfunktsiooni. Käesolev doktoritöö uurib, kuidas treenida, hinnata ja kasutada ennustusmudeleid äriprotsesside lõpptulemuste ennustava ja korraldusliku seire raames. Doktoritöö pakub välja taksonoomia olemasolevate meetodite klassifitseerimiseks ja võrdleb neid katseliselt. Lisaks pakub töö välja raamistiku tekstiliste andmete kasutamiseks antud ennustusmudelites. Samuti pakume välja ennustuste ajalise stabiilsuse mõiste ning koostame raamistiku korralduslikuks protsessiseireks, mis annab kasutajatele soovitusi, kas protsessi sekkuda või mitte. Katsed näitavad, et väljapakutud lahendused täiendavad olemasolevaid meetodeid ning aitavad kaasa ennustava protsessiseire süsteemide rakendamisele reaalsetes süsteemides.Recent years have witnessed a growing adoption of machine learning techniques for business improvement across various fields. Among other emerging applications, organizations are exploiting opportunities to improve the performance of their business processes by using predictive models for runtime monitoring. Such predictive process monitoring techniques take an event log (a set of completed business process execution traces) as input and use machine learning techniques to train predictive models. At runtime, these techniques predict either the next event, the remaining time, or the final outcome of an ongoing case, given its incomplete execution trace consisting of the events performed up to the present moment in the given case. In particular, a family of techniques called outcome-oriented predictive process monitoring focuses on predicting whether a case will end with a desired or an undesired outcome. The user of the system can use the predictions to decide whether or not to intervene, with the purpose of preventing an undesired outcome or mitigating its negative effects. Prescriptive process monitoring systems go beyond purely predictive ones, by not only generating predictions but also advising the user if and how to intervene in a running case in order to optimize a given utility function. This thesis addresses the question of how to train, evaluate, and use predictive models for predictive and prescriptive monitoring of business process outcomes. The thesis proposes a taxonomy and performs a comparative experimental evaluation of existing techniques in the field. Moreover, we propose a framework for incorporating textual data to predictive monitoring systems. We introduce the notion of temporal stability to evaluate these systems and propose a prescriptive process monitoring framework for advising users if and how to act upon the predictions. The results suggest that the proposed solutions complement the existing techniques and can be useful for practitioners in implementing predictive process monitoring systems in real life

    Social impact retrieval: measuring author influence on information retrieval

    Get PDF
    The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly. In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of authority and centrality to the user. This measure is then used to attribute an query-independent interest score to each of the contributions the author makes, enabling us to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two algorithms; AuthorRank and MessageRank. We present two real-world user experiments which focussed around multimedia annotation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the effectiveness of incorporating user generated content, or annotations, to improve information retrieval

    Cost- and workload-driven data management in the cloud

    Get PDF
    This thesis deals with the challenge of finding the right balance between consistency, availability, latency and costs, captured by the CAP/PACELC trade-offs, in the context of distributed data management in the Cloud. At the core of this work, cost and workload-driven data management protocols, called CCQ protocols, are developed. First, this includes the development of C3, which is an adaptive consistency protocol that is able to adjust consistency at runtime by considering consistency and inconsistency costs. Second, the development of Cumulus, an adaptive data partitioning protocol, that can adapt partitions by considering the application workload so that expensive distributed transactions are minimized or avoided. And third, the development of QuAD, a quorum-based replication protocol, that constructs the quorums in such a way so that, given a set of constraints, the best possible performance is achieved. The behavior of each CCQ protocol is steered by a cost model, which aims at reducing the costs and overhead for providing the desired data management guarantees. The CCQ protocols are able to continuously assess their behavior, and if necessary to adapt the behavior at runtime based on application workload and the cost model. This property is crucial for applications deployed in the Cloud, as they are characterized by a highly dynamic workload, and high scalability and availability demands. The dynamic adaptation of the behavior at runtime does not come for free, and may generate considerable overhead that might outweigh the gain of adaptation. The CCQ cost models incorporate a control mechanism, which aims at avoiding expensive and unnecessary adaptations, which do not provide any benefits to applications. The adaptation is a distributed activity that requires coordination between the sites in a distributed database system. The CCQ protocols implement safe online adaptation approaches, which exploit the properties of 2PC and 2PL to ensure that all sites behave in accordance with the cost model, even in the presence of arbitrary failures. It is crucial to guarantee a globally consistent view of the behavior, as in contrary the effects of the cost models are nullified. The presented protocols are implemented as part of a prototypical database system. Their modular architecture allows for a seamless extension of the optimization capabilities at any level of their implementation. Finally, the protocols are quantitatively evaluated in a series of experiments executed in a real Cloud environment. The results show their feasibility and ability to reduce application costs, and to dynamically adjust the behavior at runtime without violating their correctness

    A Probabilistic Framework for Information Modelling and Retrieval Based on User Annotations on Digital Objects

    Get PDF
    Annotations are a means to make critical remarks, to explain and comment things, to add notes and give opinions, and to relate objects. Nowadays, they can be found in digital libraries and collaboratories, for example as a building block for scientific discussion on the one hand or as private notes on the other. We further find them in product reviews, scientific databases and many "Web 2.0" applications; even well-established concepts like emails can be regarded as annotations in a certain sense. Digital annotations can be (textual) comments, markings (i.e. highlighted parts) and references to other documents or document parts. Since annotations convey information which is potentially important to satisfy a user's information need, this thesis tries to answer the question of how to exploit annotations for information retrieval. It gives a first answer to the question if retrieval effectiveness can be improved with annotations. A survey of the "annotation universe" reveals some facets of annotations; for example, they can be content level annotations (extending the content of the annotation object) or meta level ones (saying something about the annotated object). Besides the annotations themselves, other objects created during the process of annotation can be interesting for retrieval, these being the annotated fragments. These objects are integrated into an object-oriented model comprising digital objects such as structured documents and annotations as well as fragments. In this model, the different relationships among the various objects are reflected. From this model, the basic data structure for annotation-based retrieval, the structured annotation hypertext, is derived. In order to thoroughly exploit the information contained in structured annotation hypertexts, a probabilistic, object-oriented logical framework called POLAR is introduced. In POLAR, structured annotation hypertexts can be modelled by means of probabilistic propositions and four-valued logics. POLAR allows for specifying several relationships among annotations and annotated (sub)parts or fragments. Queries can be posed to extract the knowledge contained in structured annotation hypertexts. POLAR supports annotation-based retrieval, i.e. document and discussion search, by applying an augmentation strategy (knowledge augmentation, propagating propositions from subcontexts like annotations, or relevance augmentation, where retrieval status values are propagated) in conjunction with probabilistic inference, where P(d -> q), the probability that a document d implies a query q, is estimated. POLAR's semantics is based on possible worlds and accessibility relations. It is implemented on top of four-valued probabilistic Datalog. POLAR's core retrieval functionality, knowledge augmentation with probabilistic inference, is evaluated for discussion and document search. The experiments show that all relevant POLAR objects, merged annotation targets, fragments and content annotations, are able to increase retrieval effectiveness when used as a context for discussion or document search. Additional experiments reveal that we can determine the polarity of annotations with an accuracy of around 80%
    corecore