217 research outputs found

    Brotli: A General-Purpose Data Compressor

    Get PDF
    Brotli is an open source general-purpose data compressor introduced by Google in late 2013 and now adopted in most known browsers and Web servers. It is publicly available on GitHub and its data format was submitted as RFC 7932 in July 2016. Brotli is based on the Lempel-Ziv compression scheme and planned as a generic replacement of Gzip and ZLib. The main goal in its design was to compress data on the Internet, which meant optimizing the resources used at decoding time, while achieving maximal compression density. This article is intended to provide the first thorough, systematic description of the Brotli format as well as a detailed computational and experimental analysis of the main algorithmic blocks underlying the current encoder implementation, together with a comparison against compressors of different families constituting the state-of-the-art either in practice or in theory. This treatment will allow us to raise a set of new algorithmic and software engineering problems that deserve further attention from the scientific community

    Hierarchical Relative Lempel-Ziv Compression

    Get PDF
    Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string S is compressed relative to a second string R (called the reference) by parsing S into a sequence of substrings that occur in R. RLZ is particularly effective at compressing sets of strings that have a high degree of similarity to the reference string, such as a set of genomes of individuals from the same species. With the now cheap cost of DNA sequencing, such datasets have become extremely abundant and are rapidly growing. In this paper, instead of using a single reference string for the entire collection, we investigate the use of different reference strings for subsets of the collection, with the aim of improving compression. In particular, we propose a new compression scheme hierarchical relative Lempel-Ziv (HRLZ) which form a rooted tree (or hierarchy) on the strings and then compress each string using RLZ with parent as reference, storing only the root of the tree in plain text. To decompress, we traverse the tree in BFS order starting at the root, decompressing children with respect to their parent. We show that this approach leads to a twofold improvement in compression on bacterial genome datasets, with negligible effect on decompression time compared to the standard single reference approach. We show that an effective hierarchy for a given set of strings can be constructed by computing the optimal arborescence of a completed weighted digraph of the strings, with weights as the number of phrases in the RLZ parsing of the source and destination vertices. We further show that instead of computing the complete graph, a sparse graph derived using locality-sensitive hashing can significantly reduce the cost of computing a good hierarchy, without adversely effecting compression performance

    Learning Semantic Information from Multimodal Data using Deep Neural Networks

    Get PDF
    During the last decades, most collective information has been digitized to form an immense database distributed across the Internet. This can also be referred to as Big data, a collection of data that is vast in volume and still growing with time. Nowadays, we can say that Big data is everywhere. We might not even realize how much it affects our daily life as it is applied in many ways, ranging from online shopping, music streaming, TV streaming, travel and transportation, energy, fighting crime, to health care. Many organizations and companies have been collecting and analyzing large volumes of data to solve domain-specific problems or making business decisions. One of the powerful tools that can be used to extract value from Big data is Deep learning, a type of machine learning algorithm inspired by the structure and function of the human brain called artificial neural networks that learn from large amounts of data. Deep learning has been widely used and applied in many research fields such as natural language processing, IoT applications, and computer vision. In this thesis, we introduce three Deep Neural Networks that used to learn semantic information from different types of data and a design guideline to accelerate Neural Network Layer on a general propose computing platform. First, we focus on the text type data. We proposed a new feature extraction technique to preprocess the dataset and optimize the original Restricted Boltzmann Machine (RBM) model to generate the more meaningful topic that better represents the given document. Our proposed method can improve the generated topic accuracy by up to 12.99% on Open Movie, Reuters, and 20NewsGroup datasets. Moving from text to image type data and with additional click locations, we proposed a human in a loop automatic image labeling framework focusing on aerial images with fewer features for detection. The proposed model consists of two main parts, a prediction model and an adjustment model. The user first provides click locations to the prediction model to generate a bounding box of a specific object. The bounding box is then fine-tuned by the adjustment model for more accurate size and location. A feedback and retrain mechanism is implemented that allows the users to manually adjust the generated bounding box and provide feedback to incrementally train the adjustment network during runtime. This unique online learning feature enables the user to generalize the existing model to target classes not initially presented in the training set, and gradually improves the specificity of the model to those new targets during online learning. Combining text and image type data, we proposed a Multi-region Attention-assisted Grounding network (MAGNet) framework that utilizes spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. Our framework is independent of external proposal generation systems and without additional information, it can develop an understanding of the query phrase in relation to the image to achieve respectable results in Flickr30k entities and 12% improvement over the state-of-the-art in ReferIt game. Additionally, our model is capable of grounding multiple regions for a query phrase, which is more suitable for real-life applications. Although Deep neural networks (DNNs) have become a powerful tool, it is highly expensive in both computational time and storage cost. To optimize and improve the performance of the network while maintaining the accuracy, the block-circulant matrix-based (BCM) algorithm has been introduced. It has been proven to be highly effective when implemented using customized hardware, such as FPGAs. However, its performance suffers on general purpose computing platforms. In certain cases, using the BCM does not improve the total computation time of the networks at all. With this problem, we proposed a parallel implementation of the BCM layer, and guidelines that generally lead to better implementation practice is provided. The guidelines run across popular implementation language and packages including Python, numpy, intel-numpy, tensorflow, and nGraph

    Managing tail latency in large scale information retrieval systems

    Get PDF
    As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency

    Index compression for information retrielval systems

    Get PDF
    [Abstract] Given the increasing amount of information that is available today, there is a clear need for Information Retrieval (IR) systems that can process this information in an efficient and effective way. Efficient processing means minimising the amount of time and space required to process data, whereas effective processing means identifying accurately which information is relevant to the user and which is not. Traditionally, efficiency and effectiveness are at opposite ends (what is beneficial to efficiency is usually harmful to effectiveness, and vice versa), so the challenge of IR systems is to find a compromise between efficient and effective data processing. This thesis investigates the efficiency of IR systems. It suggests several novel strategies that can render IR systems more efficient by reducing the index size of IR systems, referred to as index compression. The index is the data structure that stores the information handled in the retrieval process. Two different approaches are proposed for index compression, namely document reordering and static index pruning. Both of these approaches exploit document collection characteristics in order to reduce the size of indexes, either by reassigning the document identifiers in the collection in the index, or by selectively discarding information that is less relevant to the retrieval process by pruning the index. The index compression strategies proposed in this thesis can be grouped into two categories: (i) Strategies which extend state of the art in the field of efficiency methods in novel ways. (ii) Strategies which are derived from properties pertaining to the effectiveness of IR systems; these are novel strategies, because they are derived from effectiveness as opposed to efficiency principles, and also because they show that efficiency and effectiveness can be successfully combined for retrieval. The main contributions of this work are in indicating principled extensions of state of the art in index compression, and also in suggesting novel theoretically-driven index compression techniques which are derived from principles of IR effectiveness. All these techniques are evaluated extensively, in thorough experiments involving established datasets and baselines, which allow for a straight-forward comparison with state of the art. Moreover, the optimality of the proposed approaches is addressed from a theoretical perspective.[Resumen] Dada la creciente cantidad de información disponible hoy en día, existe una clara necesidad de sistemas de Recuperación de Información (RI) que sean capaces de procesar esa información de una manera efectiva y eficiente. En este contexto, eficiente significa cantidad de tiempo y espacio requeridos para procesar datos, mientras que efectivo significa identificar de una manera precisa qué información es relevante para el usuario y cual no lo es. Tradicionalmente, eficiencia y efectividad se encuentran en polos opuestos - lo que es beneficioso para la eficiencia, normalmente perjudica la efectividad y viceversa - así que un reto para los sistemas de RI es encontrar un compromiso adecuado entre el procesamiento efectivo y eficiente de los datos. Esta tesis investiga el problema de la eficiencia de los sistemas de RI. Sugiere diferentes estrategias novedosas que pueden permitir la reducción de los índices de los sistemas de RI, enmarcadas dentro da las técnicas conocidas como compresión de índices. El índice es la estructura de datos que almacena la información utilizada en el proceso de recuperación. Se presentan dos aproximaciones diferentes para la compresión de los índices, referidas como reordenación de documentos y pruneado estático del índice. Ambas aproximaciones explotan características de colecciones de documentos para reducir el tamaño final de los índices, mediante la reasignación de los identificadores de los documentos de la colección o bien descartando selectivamente la información que es "menos relevante" para el proceso de recuperación. Las estrategias de compresión propuestas en este tesis se pueden agrupar en dos categorías: (i) estrategias que extienden el estado del arte en la eficiencia de una manera novedosa y (ii) estrategias derivadas de propiedades relacionadas con los principios de la efectividad en los sistemas de RI; estas estrategias son novedosas porque son derivadas desde principios de la efectividad como contraposición a los de la eficiencia, e porque revelan como la eficiencia y la efectividad pueden ser combinadas de una manera efectiva para la recuperación de información. Las contribuciones de esta tesis abarcan la elaboración de técnicas del estado del arte en compresión de índices y también en la derivación de técnicas de compresión basadas en fundamentos teóricos derivados de los principios de la efectividad de los sistemas de RI. Todas estas técnicas han sido evaluadas extensamente con numerosos experimentos que involucran conjuntos de datos y técnicas de referencia bien establecidas en el campo, las cuales permiten una comparación directa con el estado del arte. Finalmente, la optimalidad de las aproximaciones presentadas es tratada desde una perspectiva teórica

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
    corecore