50 research outputs found

    Index compression for information retrielval systems

    Get PDF
    [Abstract] Given the increasing amount of information that is available today, there is a clear need for Information Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient processing means minimising the amount of time and space required to process data, whereas effective processing means identifying accurately which information is relevant to the user and which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find a compromise between efficient and effective data processing. This thesis investigates the efficiency of IR systems. It suggests several novel strategies that can render IR systems more efficient by reducing the index size of IR systems, referred to as index compression. The index is the data structure that stores the information handled in the retrieval process. Two different approaches are proposed for index compression, namely document reordering and static index pruning. Both of these approaches exploit document collection characteristics in order to reduce the size of indexes, either by reassigning the document identifiers in the collection in the index, or by selectively discarding information that is less relevant to the retrieval process by pruning the index. The index compression strategies proposed in this thesis can be grouped into two categories: (i) Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies which are derived from properties pertaining to the effectiveness of IR systems; these are novel strategies, because they are derived from effectiveness as opposed to efficiency principles, and also because they show that efficiency and effectiveness can be successfully combined for retrieval. The main contributions of this work are in indicating principled extensions of state of the art in index compression, and also in suggesting novel theoretically-driven index compression techniques which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in thorough experiments involving established datasets and baselines, which allow for a straight-forward comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed from a theoretical perspective.[Resumen] Dada la creciente cantidad de informaci贸n disponible hoy en d铆a, existe una clara necesidad de sistemas de Recuperaci贸n de Informaci贸n (RI) que sean capaces de procesar esa informaci贸n de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qu茅 informaci贸n es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - as铆 que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos. Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducci贸n de los 铆ndices de los sistemas de RI, enmarcadas dentro da las t茅cnicas conocidas como compresi贸n de 铆ndices. El 铆ndice es la estructura de datos que almacena la informaci贸n utilizada en el proceso de recuperaci贸n. Se presentan dos aproximaciones diferentes para la compresi贸n de los 铆ndices, referidas como reordenaci贸n de documentos y pruneado est谩tico del 铆ndice. Ambas aproximaciones explotan caracter铆sticas de colecciones de documentos para reducir el tama帽o final de los 铆ndices, mediante la reasignaci贸n de los identificadores de los documentos de la colecci贸n o bien descartando selectivamente la informaci贸n que es "menos relevante" para el proceso de recuperaci贸n. Las estrategias de compresi贸n propuestas en este tesis se pueden agrupar en dos categor铆as: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposici贸n a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperaci贸n de informaci贸n. Las contribuciones de esta tesis abarcan la elaboraci贸n de t茅cnicas del estado del arte en compresi贸n de 铆ndices y tambi茅n en la derivaci贸n de t茅cnicas de compresi贸n basadas en fundamentos te贸ricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas t茅cnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y t茅cnicas de referencia bien establecidas en el campo, las cuales permiten una comparaci贸n directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva te贸rica

    Improving Efficiency, Expressiveness and Security of Searchable Encryption

    Get PDF
    A large part of our personal data, ranging from medical and financial records to our social activity, is stored online in cloud servers. Frequent data breaches threaten to expose these data to malicious third parties, often with catastrophic consequences (estimated to several billion of US dollars annually). In this thesis, we use, extend and improve Searchable Encryption (SE) in order to build the next generation encrypted databases/systems that will prevent such undesirable situations. Our goal is to build systems that are both practical and provably secure, while allowing expressive search and computation on encrypted data. Towards this goal, we have proposed new SE schemes that achieve the following: (i) have better search/computation time, (ii) allow expressive queries such as range, join, group-by, as well as dynamic query workloads, and (iii) provide new adjustable security-efficiency trade-offs---leading to robust and efficient schemes even against very powerful adversaries

    Software similarity and classification

    Full text link
    This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities

    Error processes in the integration of digital cartographic data in geographic information systems.

    Get PDF
    Errors within a Geographic Information System (GIS) arise from several factors. In the first instance receiving data from a variety of different sources results in a degree of incompatibility between such information. Secondly, the very processes used to acquire the information into the GIS may in fact degrade the quality of the data. If geometric overlay (the very raison d'etre of many GISs) is to be performed, such inconsistencies need to be carefully examined and dealt with. A variety of techniques exist for the user to eliminate such problems, but all of these tend to rely on the geometry of the information, rather than on its meaning or nature. This thesis explores the introduction of error into GISs and the consequences this has for any subsequent data analysis. Techniques for error removal at the overlay stage are also examined and improved solutions are offered. Furthermore, the thesis also looks at the role of the data model and the potential detrimental effects this can have, in forcing the data to be organised into a pre-defined structure
    corecore