16 research outputs found

    Scalable DB+IR technology: processing Probabilistic Datalog with HySpirit

    Get PDF
    Probabilistic Datalog (PDatalog, proposed in 1995) is a probabilistic variant of Datalog and a nice conceptual idea to model Information Retrieval in a logical, rule-based programming paradigm. Making PDatalog work in real-world applications requires more than probabilistic facts and rules, and the semantics associated with the evaluation of the programs. We report in this paper some of the key features of the HySpirit system required to scale the execution of PDatalog programs. Firstly, there is the requirement to express probability estimation in PDatalog. Secondly, fuzzy-like predicates are required to model vague predicates (e.g. vague match of attributes such as age or price). Thirdly, to handle large data sets there are scalability issues to be addressed, and therefore, HySpirit provides probabilistic relational indexes and parallel and distributed processing. The main contribution of this paper is a consolidated view on the methods of the HySpirit system to make PDatalog applicable in real-scale applications that involve a wide range of requirements typical for data (information) management and analysis

    A Probabilistic Framework for Information Modelling and Retrieval Based on User Annotations on Digital Objects

    Get PDF
    Annotations are a means to make critical remarks, to explain and comment things, to add notes and give opinions, and to relate objects. Nowadays, they can be found in digital libraries and collaboratories, for example as a building block for scientific discussion on the one hand or as private notes on the other. We further find them in product reviews, scientific databases and many "Web 2.0" applications; even well-established concepts like emails can be regarded as annotations in a certain sense. Digital annotations can be (textual) comments, markings (i.e. highlighted parts) and references to other documents or document parts. Since annotations convey information which is potentially important to satisfy a user's information need, this thesis tries to answer the question of how to exploit annotations for information retrieval. It gives a first answer to the question if retrieval effectiveness can be improved with annotations. A survey of the "annotation universe" reveals some facets of annotations; for example, they can be content level annotations (extending the content of the annotation object) or meta level ones (saying something about the annotated object). Besides the annotations themselves, other objects created during the process of annotation can be interesting for retrieval, these being the annotated fragments. These objects are integrated into an object-oriented model comprising digital objects such as structured documents and annotations as well as fragments. In this model, the different relationships among the various objects are reflected. From this model, the basic data structure for annotation-based retrieval, the structured annotation hypertext, is derived. In order to thoroughly exploit the information contained in structured annotation hypertexts, a probabilistic, object-oriented logical framework called POLAR is introduced. In POLAR, structured annotation hypertexts can be modelled by means of probabilistic propositions and four-valued logics. POLAR allows for specifying several relationships among annotations and annotated (sub)parts or fragments. Queries can be posed to extract the knowledge contained in structured annotation hypertexts. POLAR supports annotation-based retrieval, i.e. document and discussion search, by applying an augmentation strategy (knowledge augmentation, propagating propositions from subcontexts like annotations, or relevance augmentation, where retrieval status values are propagated) in conjunction with probabilistic inference, where P(d -> q), the probability that a document d implies a query q, is estimated. POLAR's semantics is based on possible worlds and accessibility relations. It is implemented on top of four-valued probabilistic Datalog. POLAR's core retrieval functionality, knowledge augmentation with probabilistic inference, is evaluated for discussion and document search. The experiments show that all relevant POLAR objects, merged annotation targets, fragments and content annotations, are able to increase retrieval effectiveness when used as a context for discussion or document search. Additional experiments reveal that we can determine the polarity of annotations with an accuracy of around 80%

    Techniques for organizational memory information systems

    Get PDF
    The KnowMore project aims at providing active support to humans working on knowledge-intensive tasks. To this end the knowledge available in the modeled business processes or their incarnations in specific workflows shall be used to improve information handling. We present a representation formalism for knowledge-intensive tasks and the specification of its object-oriented realization. An operational semantics is sketched by specifying the basic functionality of the Knowledge Agent which works on the knowledge intensive task representation. The Knowledge Agent uses a meta-level description of all information sources available in the Organizational Memory. We discuss the main dimensions that such a description scheme must be designed along, namely information content, structure, and context. On top of relational database management systems, we basically realize deductive object- oriented modeling with a comfortable annotation facility. The concrete knowledge descriptions are obtained by configuring the generic formalism with ontologies which describe the required modeling dimensions. To support the access to documents, data, and formal knowledge in an Organizational Memory an integrated domain ontology and thesaurus is proposed which can be constructed semi-automatically by combining document-analysis and knowledge engineering methods. Thereby the costs for up-front knowledge engineering and the need to consult domain experts can be considerably reduced. We present an automatic thesaurus generation tool and show how it can be applied to build and enhance an integrated ontology /thesaurus. A first evaluation shows that the proposed method does indeed facilitate knowledge acquisition and maintenance of an organizational memory

    Techniques for improving efficiency and scalability for the integration of information retrieval and databases

    Get PDF
    PhDThis thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with particular focuses on improving efficiency and scalability of integrated IR and DB technology (IR+DB). The main purpose of this study is to develop efficient and scalable techniques for supporting integrated IR and DB technology, which is a popular approach today for handling complex queries over text and structured data. Our specific interest in this thesis is how to efficiently handle queries over large-scale text and structured data. The work is based on a technology that integrates probability theory and relational algebra, where retrievals for text and data are to be expressed in probabilistic logical programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient processing of probabilistic logical programs, we proposed three optimization techniques that focus on aspects covered logical and physical layers, which include: scoring-driven query optimization using scoring expression, query processing with top-k incorporated pipeline, and indexing with relational inverted index. Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics of implied scoring functions of PRA expressions, so that efficient query execution plan can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and effectiveness so that to improve query response time, we studied methods for incorporating topk algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which can be used to support efficient probability estimation and aggregation as well as conventional relational operations. Experiments were carried out to investigate the performances of proposed techniques. Experimental results showed that the efficiency and scalability of an IR+DB prototype have been improved, while the system can handle queries efficiently on considerable large data sets for a number of IR tasks

    Language Models and Smoothing Methods for Information Retrieval

    Get PDF
    Language Models and Smoothing Methods for Information Retrieval (Sprachmodelle und Glättungsmethoden für Information Retrieval) Najeeb A. Abdulmutalib Kurzfassung der Dissertation Retrievalmodelle bilden die theoretische Grundlage für effektive Information-Retrieval-Methoden. Statistische Sprachmodelle stellen eine neue Art von Retrievalmodellen dar, die seit etwa zehn Jahren in der Forschung betrachtet werde. Im Unterschied zu anderen Modellen können sie leichter an spezifische Aufgabenstellungen angepasst werden und liefern häufig bessere Retrievalergebnisse. In dieser Dissertation wird zunächst ein neues statistisches Sprachmodell vorgestellt, das explizit Dokumentlängen berücksichtigt. Aufgrund der spärlichen Beobachtungsdaten spielen Glättungsmethoden bei Sprachmodellen eine wichtige Rolle. Auch hierfür stellen wir eine neue Methode namens 'exponentieller Glättung' vor. Der experimentelle Vergleich mit konkurrierenden Ansätzen zeigt, dass unsere neuen Methoden insbesondere bei Kollektionen mit stark variierenden Dokumentlängen überlegene Ergebnisse liefert. In einem zweiten Schritt erweitern wir unseren Ansatz auf XML-Retrieval, wo hierarchisch strukturierte Dokumente betrachtet werden und beim fokussierten Retrieval möglichst kleine Dokumentteile gefunden werden sollen, die die Anfrage vollständig beantworten. Auch hier demonstriert der experimentelle Vergleich mit anderen Ansätzen die Qualität unserer neu entwickelten Methoden. Der dritte Teil der Arbeit beschäftigt sich mit dem Vergleich von Sprachmodellen und der klassischen tf*idf-Gewichtung. Neben einem besseren Verständnis für die existierenden Glättungsmethoden führt uns dieser Ansatz zur Entwicklung des Verfahrens der 'empirischen Glättung'. Die damit durchgeführten Retrievalerexperimente zeigen Verbesserungen gegenüber anderen Glättungsverfahren.Language Models and Smoothing Methods for Information Retrieval Najeeb A. Abdulmutalib Abstract of the Dissertation Designing an effective retrieval model that can rank documents accurately for a given query has been a central problem in information retrieval for several decades. An optimal retrieval model that is both effective and efficient and that can learn from feedback information over time is needed. Language models are new generation of retrieval models and have been applied since the last ten years to solve many different information retrieval problems. Compared with the traditional models such as the vector space model, they can be more easily adapted to model non traditional and complex retrieval problems and empirically they tend to achieve comparable or better performance than the traditional models. Developing new language models is currently an active research area in information retrieval. In the first stage of this thesis we present a new language model based on an odds formula, which explicitly incorporates document length as a parameter. To address the problem of data sparsity where there is rarely enough data to accurately estimate the parameters of a language model, smoothing gives a way to combine less specific, more accurate information with more specific, but noisier data. We introduce a new smoothing method called exponential smoothing, which can be combined with most language models. We present experimental results for various language models and smoothing methods on a collection with large document length variation, and show that our new methods compare favourably with the best approaches known so far. We discuss the collection effect on the retrieval function, where we investigate the performance of well known models and compare the results conducted using two variant collections. In the second stage we extend the current model from flat text retrieval to XML retrieval since there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Compared to traditional information retrieval, where whole documents are usually indexed and retrieved as single complete units, information retrieval from XML documents creates additional retrieval challenges. By exploiting the logical document structure, XML allows for more focussed retrieval that identifies elements rather than documents as answers to user queries. Finally we show how smoothing plays a role very similar to that of the idf function: beside the obvious role of smoothing, it also improves the accuracy of the estimated language model. The within document frequency and the collection frequency of a term actually influence the probability of relevance, which led us to a new class of smoothing function based on numeric prediction, which we call empirical smoothing. Its retrieval quality outperforms that of other smoothing methods

    A study of the kinematics of probabilities in information retrieval

    Get PDF
    In Information Retrieval (IR), probabilistic modelling is related to the use of a model that ranks documents in decreasing order of their estimated probability of relevance to a user's information need expressed by a query. In an IR system based on a probabilistic model, the user is guided to examine first the documents that are the most likely to be relevant to his need. If the system performed well, these documents should be at the top of the retrieved list. In mathematical terms the problem consists of estimating the probability P(R | q,d), that is the probability of relevance given a query q and a document d. This estimate should be performed for every document in the collection, and documents should then be ranked according to this measure. For this evaluation the system should make use of all the information available in the indexing term space. This thesis contains a study of the kinematics of probabilities in probabilistic IR. The aim is to get a better insight of the behaviour of the probabilistic models of IR currently in use and to propose new and more effective models by exploiting different kinematics of probabilities. The study is performed both from a theoretical and an experimental point of view. Theoretically, the thesis explores the use of the probability of a conditional, namely P(d → q), to estimate the conditional probability P(R | q,d). This is achieved by interpreting the term space in the context of the "possible worlds semantics". Previous approaches in this direction had as their basic assumption the consideration that "a document is a possible world". In this thesis a different approach is adopted, based on the assumption that "a term is a possible world". This approach enables the exploitation of term-term semantic relationships in the term space, estimated using an information theoretic measure. This form of information is rarely used in IR at retrieval time. Two new models of IR are proposed, based on two different way of estimating P(d → q) using a logical technique called Imaging. The first model is called Retrieval by Logical Imaging; the second is called Retrieval by General Logical Imaging, being a generalisation of the first model. The probability kinematics of these two models is compared with that of two other proposed models: the Retrieval by Joint Probability model and the Retrieval by Conditional Probability model. These last two models mimic the probability kinematics of the Vector Space model and of the Probabilistic Retrieval model. Experimentally, the retrieval effectiveness of the above four models is analysed and compared using five test collections of different sizes and characteristics. The results of this experimentation depend heavily on the choice of term weight and term similarity measures adopted. The most important conclusion of this thesis is that theoretically a probability transfer that takes into account the semantic similarity between the probability-donor and the probability-recipient is more effective than a probability transfer that does not take that into account. In the context of IR this is equivalent to saying that models that exploit the semantic similarity between terms in the term space at retrieval time are more effective that models that do not do that. Unfortunately, while the experimental investigation carried out using small test collections provide evidence supporting this conclusion, experiments performed using larger test collections do not provide as much supporting evidence (although they do not provide contrasting evidence either). The peculiar characteristics of the term space of different collections play an important role in shaping the effects that different probability kinematics have on the effectiveness of the retrieval process. The above result suggests the necessity and the usefulness of further investigations into more complex and optimised models of probabilistic IR, where probability kinematics follows non-classical approaches. The models proposed in this thesis are just two such approaches; other ones can be developed using recent results achieved in other fields, such as non-classical logics and belief revision theory
    corecore