10,847 research outputs found

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Re-Pair Compression of Inverted Lists

    Full text link
    Compression of inverted lists with methods that support fast intersection operations is an active research topic. Most compression schemes rely on encoding differences between consecutive positions with techniques that favor small numbers. In this paper we explore a completely different alternative: We use Re-Pair compression of those differences. While Re-Pair by itself offers fast decompression at arbitrary positions in main and secondary memory, we introduce variants that in addition speed up the operations required for inverted list intersection. We compare the resulting data structures with several recent proposals under various list intersection algorithms, to conclude that our Re-Pair variants offer an interesting time/space tradeoff for this problem, yet further improvements are required for it to improve upon the state of the art

    Abnormal microglia and enhanced inflammation-related gene transcription in mice with conditional deletion of Ctcf in Camk2a-Cre-expressing neurons

    Get PDF
    CCCTC-binding factor (CTCF) is an 11 zinc finger DNA-binding domain protein that regulates gene expression by modifying 3D chromatin structure. Human mutations inCTCFcause intellectual disability and autistic features. Knocking outCtcfin mouse embryonic neurons is lethal by neonatal age, but the effects of CTCF deficiency in postnatal neurons are less well studied. We knocked outCtcfpostnatally in glutamatergic forebrain neurons under the control ofCamk2a-Cre. CtcfloxP/loxP;Camk2a-Cre+(CtcfCKO) mice of both sexes were viable and exhibited profound deficits in spatial learning/memory, impaired motor coordination, and decreased sociability by 4 months of age.CtcfCKO mice also had reduced dendritic spine density in the hippocampus and cerebral cortex. Microarray analysis of mRNA fromCtcfCKO mouse hippocampus identified increased transcription of inflammation-related genes linked to microglia. Separate microarray analysis of mRNA isolated specifically fromCtcfCKO mouse hippocampal neurons by ribosomal affinity purification identified upregulation of chemokine signaling genes, suggesting crosstalk between neurons and microglia inCtcfCKO hippocampus. Finally, we found that microglia inCtcfCKO mouse hippocampus had abnormal morphology by Sholl analysis and increased immunostaining for CD68, a marker of microglial activation. Our findings confirm thatCtcfKO in postnatal neurons causes a neurobehavioral phenotype in mice and provide novel evidence that CTCF depletion leads to overexpression of inflammation-related genes and microglial dysfunction.SIGNIFICANCE STATEMENTCCCTC-binding factor (CTCF) is a DNA-binding protein that organizes nuclear chromatin topology. Mutations inCTCFcause intellectual disability and autistic features in humans. CTCF deficiency in embryonic neurons is lethal in mice, but mice with postnatal CTCF depletion are less well studied. We find that mice lackingCtcfinCamk2a-expressing neurons (CtcfCKO mice) have spatial learning/memory deficits, impaired fine motor skills, subtly altered social interactions, and decreased dendritic spine density. We demonstrate thatCtcfCKO mice overexpress inflammation-related genes in the brain and have microglia with abnormal morphology that label positive for CD68, a marker of microglial activation. Our findings suggest that inflammation and dysfunctional neuron–microglia interactions are factors in the pathology of CTCF deficiency.</jats:p

    Algorithms and Data Structures for In-Memory Text Search Engines

    Get PDF

    Intersection caching techniques in search engines

    Get PDF
    Los motores de búsqueda procesan enormes cantidades de datos (páginas web) paraconstruir estructuras sofisticadas que soportan la búsqueda. La cantidad cada vez mayorde usuarios e información disponible en la web impone retos de rendimiento y el procesamientode las consultas (queries) consume una cantidad significativa de recursos computacionales. En este contexto, se requieren muchas técnicas de optimización para manejaruna alta carga de consultas. Uno de los mecanismos más importantes para abordareste tipo de problemas es el uso de cachés (caching), que básicamente consiste en manteneren memoria items utilizados previamente, en base a sus patrones de frecuencia, tiempode aparición o costo. La arquitectura típica de un motor de búsqueda está compuesta por un nodo front-end (broker) que proporciona la interfaz de usuario y un conjunto (grande) de nodosde búsqueda que almacenan los datos y soportan las búsquedas. En este contexto, elalmacenamiento en caché se puede implementar en varios niveles. A nivel del broker semantiene generalmente un caché de resultados (results caché). Éste almacena la lista finalde los resultados correspondientes a las consultas más frecuentes o recientes. Además, uncaché de listas de posteo (posting list caché) se implementa en cada nodo de búsqueda. Este caché almacena en memoria las listas de posteo de términos populares o valiosos,para evitar el acceso al disco. Complementariamente, se puede implementar un caché deinterscciones (intersection caché) para obtener mejoras de rendimiento adicionales. Eneste caso, se intenta explotar aquellos pares de términos que ocurren con mayor frecuenciaguardando en la memoria del nodo de búsqueda el resultado de la intersección de lascorrespondientes listas invertidas de los términos (ahorrando no sólo tiempo de acceso adisco, sino también tiempo de CPU). Todos los tipos de caché se gestionan mediante una "política de reemplazo", la cualdecide si va a desalojar algunas entradas del caché en el caso de que un elemento requieraser insertado o no. Esta política desaloja idealmente entradas que son poco probablesde que vayan a ser utilizadas nuevamente o aquellas que se espera que proporcionen unbeneficio menor. Esta tesis se enfoca en el nivel del Intersection caché utilizando políticas de reemplazoque consideran el costo de los items para desalojar un elemento (cost-aware policies). Se diseñan, analizan y evalúan políticas de reemplazo estáticas, dinámicas e híbridas,adaptando algunos algoritmos utilizados en otros dominios a éste. Estas políticas secombinan con diferentes estrategias de resolución de las consultas y se diseñan algunasque aprovechan la existencia del caché de intersecciones, reordenando los términos de laconsulta de una manera particular. También se explora la posibilidad de reservar todo el espacio de memoria asignadaal almacenamiento en caché en los nodos de búsqueda para un nuevo caché integrado (Integrated caché) y se propone una versión estática que sustituye tanto al caché de listasde posting como al de intersecciones. Se diseña una estrategia de gestión específica paraeste caché que evita la duplicación de términos almacenados y, además, se tienen en cuentadiferentes estrategias para inicializar este nuevo Integrated caché. Por último, se propone, diseña y evalúa una política de admisión para el caché deintersecciones basada en principios del Aprendizaje Automático (Machine Learning) quereduce al mínimo el almacenamiento de pares de términos que no proporcionan suficientebeneficio. Se lo modela como un problema de clasificación con el objetivo de identificarlos pares de términos que aparecen con poca frecuencia en el flujo de consultas. Los resultados experimentales de la aplicación de todos los métodos propuestos utilizandoconjuntos de datos reales y un entorno de simulación muestran que se obtienenmejoras interesantes, incluso en este hiper-optimizado campo.Search Engines process huge amounts of data (i.e. web pages) to build sophisticated structuresthat support search. The ever growing amount of users and information availableon the web impose performance challenges and query processing consumes a significantamount of computational resources. In this context, many optimization techniques arerequired to cope with heavy query trafic. One of the most important mechanisms toaddress this kind of problems is caching, that basically consists of keeping in memorypreviously used items, based on its frequency, recency or cost patterns. A typical architecture of a search engine is composed of a front-end node (broker)that provides the user interface and a cluster of a large number of search nodes that storethe data and support search. In this context, caching can be implemented at severallevels. At broker level, a results cache is generally maintained. This stores the finallist of results corresponding to the most frequent or recent queries. A posting list cacheis implemented at each node that simply stores in a memory buffer the posting lists ofpopular or valuable terms, to avoid accessing disk. Complementarily, an intersectioncache may be implemented to obtain additional performance gains. This cache attemptsto exploit frequently occurring pairs of terms by keeping the results of intersecting thecorresponding inverted lists in the memory of the search node (saving not only disk accesstime but CPU time as well). All cache types are managed using a so-called "replacement policy" that decideswhether to evict some entry from the cache in the case of a cache miss or not. This policyideally evicts entries that are unlikely to be a hit or those that are expected to provideless benefit. In this thesis we focus on the Intersection Cache level using cost-aware caching policies,that is, those policies that take into account the cost of the queries to evict an itemfrom cache. We carefully design, analyze and evaluate static, dynamic and hybrid costawareeviction policies adapting some cost-aware policies from other domains to ours. These policies are combined with different query-evaluation strategies, some of them originallydesigned to take benefit of the existence of an intersection cache by reordering thequery terms in a particular way. We also explore the possibility of reserving the whole memory space allocated tocaching at search nodes to a new integrated cache and propose a static cache (namely Integrated Cache) that replaces both list and intersection caches. We design a specificcache management strategy that avoids the duplication of cached terms and we alsoconsider different strategies to populate the integrated cache. Finally, we design and evaluate an admission policy for the Intersection Cache basedin Machine Learning principles that minimize caching pair of terms that do not provideenough advantage. We model this as a classification problem with the aim of identifyingthose pairs of terms that appear infrequently in the query stream. The experimental results from applying all the proposed approaches using real datasetson a simulation framework show that we obtain useful improvements over the baselineeven in this already hyper-optimized field.Fil: Tolosa, Gabriel Hernán. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales; Argentina

    CICLAD: A Fast and Memory-efficient Closed Itemset Miner for Streams

    Full text link
    Mining association rules from data streams is a challenging task due to the (typically) limited resources available vs. the large size of the result. Frequent closed itemsets (FCI) enable an efficient first step, yet current FCI stream miners are not optimal on resource consumption, e.g. they store a large number of extra itemsets at an additional cost. In a search for a better storage-efficiency trade-off, we designed Ciclad,an intersection-based sliding-window FCI miner. Leveraging in-depth insights into FCI evolution, it combines minimal storage with quick access. Experimental results indicate Ciclad's memory imprint is much lower and its performances globally better than competitor methods.Comment: KDD2
    • …
    corecore