331 research outputs found

    Index compression for information retrielval systems

    Get PDF
    [Abstract] Given the increasing amount of information that is available today, there is a clear need for Information Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient processing means minimising the amount of time and space required to process data, whereas effective processing means identifying accurately which information is relevant to the user and which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find a compromise between efficient and effective data processing. This thesis investigates the efficiency of IR systems. It suggests several novel strategies that can render IR systems more efficient by reducing the index size of IR systems, referred to as index compression. The index is the data structure that stores the information handled in the retrieval process. Two different approaches are proposed for index compression, namely document reordering and static index pruning. Both of these approaches exploit document collection characteristics in order to reduce the size of indexes, either by reassigning the document identifiers in the collection in the index, or by selectively discarding information that is less relevant to the retrieval process by pruning the index. The index compression strategies proposed in this thesis can be grouped into two categories: (i) Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies which are derived from properties pertaining to the effectiveness of IR systems; these are novel strategies, because they are derived from effectiveness as opposed to efficiency principles, and also because they show that efficiency and effectiveness can be successfully combined for retrieval. The main contributions of this work are in indicating principled extensions of state of the art in index compression, and also in suggesting novel theoretically-driven index compression techniques which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in thorough experiments involving established datasets and baselines, which allow for a straight-forward comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed from a theoretical perspective.[Resumen] Dada la creciente cantidad de información disponible hoy en día, existe una clara necesidad de sistemas de Recuperación de Información (RI) que sean capaces de procesar esa información de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qué información es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - así que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos. Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducción de los índices de los sistemas de RI, enmarcadas dentro da las técnicas conocidas como compresión de índices. El índice es la estructura de datos que almacena la información utilizada en el proceso de recuperación. Se presentan dos aproximaciones diferentes para la compresión de los índices, referidas como reordenación de documentos y pruneado estático del índice. Ambas aproximaciones explotan características de colecciones de documentos para reducir el tamaño final de los índices, mediante la reasignación de los identificadores de los documentos de la colección o bien descartando selectivamente la información que es "menos relevante" para el proceso de recuperación. Las estrategias de compresión propuestas en este tesis se pueden agrupar en dos categorías: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposición a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperación de información. Las contribuciones de esta tesis abarcan la elaboración de técnicas del estado del arte en compresión de índices y también en la derivación de técnicas de compresión basadas en fundamentos teóricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas técnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y técnicas de referencia bien establecidas en el campo, las cuales permiten una comparación directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva teórica

    PERICLES Deliverable 4.3:Content Semantics and Use Context Analysis Techniques

    Get PDF
    The current deliverable summarises the work conducted within task T4.3 of WP4, focusing on the extraction and the subsequent analysis of semantic information from digital content, which is imperative for its preservability. More specifically, the deliverable defines content semantic information from a visual and textual perspective, explains how this information can be exploited in long-term digital preservation and proposes novel approaches for extracting this information in a scalable manner. Additionally, the deliverable discusses novel techniques for retrieving and analysing the context of use of digital objects. Although this topic has not been extensively studied by existing literature, we believe use context is vital in augmenting the semantic information and maintaining the usability and preservability of the digital objects, as well as their ability to be accurately interpreted as initially intended.PERICLE

    Spike based neural codes : towards a novel bio-inspired still image coding schema

    Get PDF
    We asked whether rank order coding could be used to define an efficient compression scheme for still images. The main hypothesis underlying this work is that the mammalians retina generates a compressed neural code for the visual stimuli. The main novelty of our approach is to show how this neural code can be exploited in the context of image compression. Our coding scheme is a combination of a simplified spiking retina model and well known data compression techniques and consists in three main stages. The first stage is the bio-inspired retina model proposed by Thorpe et al. This model transforms of a stimulus into a wave of electrical impulses called spikes. The major property of this retina model is that spikes are ordered in time as a function of the cells activation: this yields the so-called rank order code (ROC). ROC states that the first wave of spikes give a good estimate of the input signal. In the second stage, we show how this wave of spikes can be expressed using a 4-ary dictionary alphabet: the stack run coding. The third stage consists in applying, to the stack run code, a arithmetic coder of the first order. We then compare our results to the JPEG standards and we show that our model offers similar rate/quality trade-off until 0.07 bpp, for a lower computational cost. In addition, our model offers interesting properties of scalability and of robustness to noise

    GraPE: fast and scalable Graph Processing and Embedding

    Full text link
    Graph Representation Learning methods have enabled a wide range of learning problems to be addressed for data that can be represented in graph form. Nevertheless, several real world problems in economy, biology, medicine and other fields raised relevant scaling problems with existing methods and their software implementation, due to the size of real world graphs characterized by millions of nodes and billions of edges. We present GraPE, a software resource for graph processing and random walk based embedding, that can scale with large and high-degree graphs and significantly speed up-computation. GraPE comprises specialized data structures, algorithms, and a fast parallel implementation that displays everal orders of magnitude improvement in empirical space and time complexity compared to state of the art software resources, with a corresponding boost in the performance of machine learning methods for edge and node label prediction and for the unsupervised analysis of graphs.GraPE is designed to run on laptop and desktop computers, as well as on high performance computing cluster

    VIABILITY OF TIME-MEMORY TRADE-OFFS IN LARGE DATA SETS

    Get PDF
    The main hypothesis of this paper is whether compression performance – both hardware and software – is at, approaching, or will ever reach a point where real-time compression of cached data in large data sets will be viable to improve hit ratios and overall throughput. The problem identified is: storage access is unable to keep up with application and user demands, and cache (RAM) is too small to contain full data sets. A literature review of several existing techniques discusses how storage IO is reduced or optimized to maximize the available performance of the storage medium. However, none of the techniques discovered preclude, or are mutually exclusive with, the hypothesis proposed herein. The methodology includes gauging three popular compressors which meet the criteria for viability: zlib, lz4, and zstd. Common storage devices are also benchmarked to establish costs for both IO and compression operations to help build charts and discover break-even points under various circumstances. The results indicate that modern CISC processors and compressors are already approaching tradeoff viability, and that FPGA and ASIC could potentially reduce all overhead by pipelining compression – nearly eliminating the cost portion of the tradeoff, leaving mostly benefit

    Content-aware compression for big textual data analysis

    Get PDF
    A substantial amount of information on the Internet is present in the form of text. The value of this semi-structured and unstructured data has been widely acknowledged, with consequent scientific and commercial exploitation. The ever-increasing data production, however, pushes data analytic platforms to their limit. This thesis proposes techniques for more efficient textual big data analysis suitable for the Hadoop analytic platform. This research explores the direct processing of compressed textual data. The focus is on developing novel compression methods with a number of desirable properties to support text-based big data analysis in distributed environments. The novel contributions of this work include the following. Firstly, a Content-aware Partial Compression (CaPC) scheme is developed. CaPC makes a distinction between informational and functional content in which only the informational content is compressed. Thus, the compressed data is made transparent to existing software libraries which often rely on functional content to work. Secondly, a context-free bit-oriented compression scheme (Approximated Huffman Compression) based on the Huffman algorithm is developed. This uses a hybrid data structure that allows pattern searching in compressed data in linear time. Thirdly, several modern compression schemes have been extended so that the compressed data can be safely split with respect to logical data records in distributed file systems. Furthermore, an innovative two layer compression architecture is used, in which each compression layer is appropriate for the corresponding stage of data processing. Peripheral libraries are developed that seamlessly link the proposed compression schemes to existing analytic platforms and computational frameworks, and also make the use of the compressed data transparent to developers. The compression schemes have been evaluated for a number of standard MapReduce analysis tasks using a collection of real-world datasets. In comparison with existing solutions, they have shown substantial improvement in performance and significant reduction in system resource requirements

    Flexi-WVSNP-DASH: A Wireless Video Sensor Network Platform for the Internet of Things

    Get PDF
    abstract: Video capture, storage, and distribution in wireless video sensor networks (WVSNs) critically depends on the resources of the nodes forming the sensor networks. In the era of big data, Internet of Things (IoT), and distributed demand and solutions, there is a need for multi-dimensional data to be part of the Sensor Network data that is easily accessible and consumable by humanity as well as machinery. Images and video are expected to become as ubiquitous as is the scalar data in traditional sensor networks. The inception of video-streaming over the Internet, heralded a relentless research for effective ways of distributing video in a scalable and cost effective way. There has been novel implementation attempts across several network layers. Due to the inherent complications of backward compatibility and need for standardization across network layers, there has been a refocused attention to address most of the video distribution over the application layer. As a result, a few video streaming solutions over the Hypertext Transfer Protocol (HTTP) have been proposed. Most notable are Apple’s HTTP Live Streaming (HLS) and the Motion Picture Experts Groups Dynamic Adaptive Streaming over HTTP (MPEG-DASH). These frameworks, do not address the typical and future WVSN use cases. A highly flexible Wireless Video Sensor Network Platform and compatible DASH (WVSNP-DASH) are introduced. The platform's goal is to usher video as a data element that can be integrated into traditional and non-Internet networks. A low cost, scalable node is built from the ground up to be fully compatible with the Internet of Things Machine to Machine (M2M) concept, as well as the ability to be easily re-targeted to new applications in a short time. Flexi-WVSNP design includes a multi-radio node, a middle-ware for sensor operation and communication, a cross platform client facing data retriever/player framework, scalable security as well as a cohesive but decoupled hardware and software design.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Parallel architectural design space exploration for real-time image compression

    Get PDF
    Embedded block coding with optimized truncation (EBCOT) is a coding algorithm used in JPEG2000. EBCOT operates on the wavelet transformed data to generate highly scalable compressed bit stream. Sub-band samples obtained from wavelet transform are partitioned into smaller blocks called code-blocks. EBCOT encoding is done on blocks to avoid error propagation through the bands and to increase robustness. Block wise encoding provides flexibility for parallel hardware implementation of EBCOT. The encoding process in JPEG2000 is divided into two phases: Tier 1 coding (Entropy encoding) and Tier 2 coding (Tag tree coding). This thesis deals with design space exploration and implementation of parallel hardware architecture of Tier 1 encoder used in JPEG2000. Parallel capabilities of Tier-1 encoder is the motivation for exploration of high performance real time image compression architecture in hardware. The design space covers the following investigations: - The effect of block-size in terms of resources, speed, and compression performance, - Computational performance. The key computational performance parameters targeted by the architecture are - significant speedup compared to a sequential implementation, - minimum processing latency and, - minimum logic resource utilization. The proposed architecture is developed for an embedded application system, coded in VHDL and synthesized for implementation on Xilinx FPGA system

    Recent Advances in Embedded Computing, Intelligence and Applications

    Get PDF
    The latest proliferation of Internet of Things deployments and edge computing combined with artificial intelligence has led to new exciting application scenarios, where embedded digital devices are essential enablers. Moreover, new powerful and efficient devices are appearing to cope with workloads formerly reserved for the cloud, such as deep learning. These devices allow processing close to where data are generated, avoiding bottlenecks due to communication limitations. The efficient integration of hardware, software and artificial intelligence capabilities deployed in real sensing contexts empowers the edge intelligence paradigm, which will ultimately contribute to the fostering of the offloading processing functionalities to the edge. In this Special Issue, researchers have contributed nine peer-reviewed papers covering a wide range of topics in the area of edge intelligence. Among them are hardware-accelerated implementations of deep neural networks, IoT platforms for extreme edge computing, neuro-evolvable and neuromorphic machine learning, and embedded recommender systems
    corecore