574 research outputs found

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    On Optimally Partitioning Variable-Byte Codes

    Get PDF
    The ubiquitous Variable-Byte encoding is one of the fastest compressed representation for integer sequences. However, its compression ratio is usually not competitive with other more sophisticated encoders, especially when the integers to be compressed are small that is the typical case for inverted indexes. This paper shows that the compression ratio of Variable-Byte can be improved by 2x by adopting a partitioned representation of the inverted lists. This makes Variable-Byte surprisingly competitive in space with the best bit-aligned encoders, hence disproving the folklore belief that Variable-Byte is space-inefficient for inverted index compression. Despite the significant space savings, we show that our optimization almost comes for free, given that: we introduce an optimal partitioning algorithm that does not affect indexing time because of its linear-time complexity; we show that the query processing speed of Variable-Byte is preserved, with an extensive experimental analysis and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering (TKDE), 15 April 201

    Efficient and Effective Query Auto-Completion

    Full text link
    Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search systems, suggesting possible ways of completing the query being typed by the user. Efficiency is crucial to make the system have a real-time responsiveness when operating in the million-scale search space. Prior work has extensively advocated the use of a trie data structure for fast prefix-search operations in compact space. However, searching by prefix has little discovery power in that only completions that are prefixed by the query are returned. This may impact negatively the effectiveness of the QAC system, with a consequent monetary loss for real applications like Web Search Engines and eCommerce. In this work we describe the implementation that empowers a new QAC system at eBay, and discuss its efficiency/effectiveness in relation to other approaches at the state-of-the-art. The solution is based on the combination of an inverted index with succinct data structures, a much less explored direction in the literature. This system is replacing the previous implementation based on Apache SOLR that was not always able to meet the required service-level-agreement.Comment: Published in SIGIR 202

    Immunohistochemical markers of stem/progenitor cells in the developing human cerebellum

    Get PDF
    The aim of this study was to identify the stem/progenitor cell markers, by immunohistochemistry, in order to highlight the cortical neurogenesis niches during the different gestational ages. To this end, the following stem/progenitor cell markers have been utilized: Sox2, Pax6, Pax2, WT1, Nestin, Vimentin and Calretinin. The expression of these markers have been compared with the expression of those markers of mature neurons and glial markers including Neurofilament (NF), neuron specific enolase (NSE), Synaptophysin (Syn), Glial fibrillary acid protein (GFAP) and S100β. Evaluating both stem/progenitor and mature cell markers lead to the identification of multiple stages of differentiation of neuronal and glial progenitors during gestation, in order to better understand the development of human cerebellar cortex

    Molecular approaches fighting nonsense

    Get PDF
    : Nonsense mutations are the result of single nucleotide substitutions in the DNA that change a sense codon (coding for an amino acid) to a nonsense or premature termination codon (PTC) within the coding region of the mRNA [...]

    Espacio público y agua en las ciudades europeas

    Get PDF
    Análisis de proyectos de regeneración urbana en las orillas de ciudades europeas según 3 categorías de inclusión, integración y superposición, con el fin de: i) analizar el uso del agua en el diseño de la forma urbana contemporánea y las relaciones que el espacio público crea con el elemento liquido; ii) identificar los instrumentos compositivos y conceptuales de los ejemplos de regeneración urbana analizados, con la finalidad de orientar futuras intervenciones en el litoral de Istanbul

    Single-Neuron Correlates of Social Identity in Freely Interacting Female Rats

    Get PDF
    The neuronal bases of social interaction are poorly understood, especially in terms of social motivation and social cognition. The present study used single-unit recording to investigate neuronal firing patterns in two inter-connected regions: the amygdala and the piriform cortex, focusing on the latter region. Olfactory signals strongly shape rodents’ social interaction: previous anatomical and physiological studies implicate the piriform cortex in olfactory pattern separation and pattern completion functions which could support memory for the odour profile of familiar conspecifics. As a first step in characterising the rodent social amygdala and piriform cortex, I paired Lister Hooded rats in an apparatus where they are fully free to engage in a variety of positive social interactions including anogenital sniffing, face to face contacts, and body contact. The apparatus was a 40x40 cm wooden square box with a wall height of 50 cm. I performed extracellular electrophysiological recordings from ensembles of single neurons tested in various social and non-social conditions (e.g. familiar rat in box vs empty box, or familiar rat vs novel rats). In addition, I simultaneously recorded behaviour with images time-stamped in synchronization with the electrophysiological recordings. The present thesis shows that there are pyramidal neurons in the rodent amygdala and piriform cortex which respond strongly to social interaction. The main finding was that about 20% of cells in the piriform cortex showed firing patterns specific to either familiar or novel conspecifics, and a similar proportion showed firing patterns specific for one of two familiar sisters

    Fast and Compact Set Intersection through Recursive Universe Partitioning

    Get PDF
    We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: A markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem
    • …
    corecore