574 research outputs found
Handling Massive N-Gram Datasets Efficiently
This paper deals with the two fundamental problems concerning the handling of
large n-gram language models: indexing, that is compressing the n-gram strings
and associated satellite data without compromising their retrieval speed; and
estimation, that is computing the probability distribution of the strings from
a large textual source. Regarding the problem of indexing, we describe
compressed, exact and lossless data structures that achieve, at the same time,
high space reductions and no time degradation with respect to state-of-the-art
solutions and related software packages. In particular, we present a compressed
trie data structure in which each word following a context of fixed length k,
i.e., its preceding k words, is encoded as an integer whose value is
proportional to the number of words that follow such context. Since the number
of words following a given context is typically very small in natural
languages, we lower the space of representation to compression levels that were
never achieved before. Despite the significant savings in space, our technique
introduces a negligible penalty at query time. Regarding the problem of
estimation, we present a novel algorithm for estimating modified Kneser-Ney
language models, that have emerged as the de-facto choice for language modeling
in both academia and industry, thanks to their relatively low perplexity
performance. Estimating such models from large textual sources poses the
challenge of devising algorithms that make a parsimonious use of the disk. The
state-of-the-art algorithm uses three sorting steps in external memory: we show
an improved construction that requires only one sorting step thanks to
exploiting the properties of the extracted n-gram strings. With an extensive
experimental analysis performed on billions of n-grams, we show an average
improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February
2019, Article No: 2
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
Efficient and Effective Query Auto-Completion
Query Auto-Completion (QAC) is an ubiquitous feature of modern textual search
systems, suggesting possible ways of completing the query being typed by the
user. Efficiency is crucial to make the system have a real-time responsiveness
when operating in the million-scale search space. Prior work has extensively
advocated the use of a trie data structure for fast prefix-search operations in
compact space. However, searching by prefix has little discovery power in that
only completions that are prefixed by the query are returned. This may impact
negatively the effectiveness of the QAC system, with a consequent monetary loss
for real applications like Web Search Engines and eCommerce. In this work we
describe the implementation that empowers a new QAC system at eBay, and discuss
its efficiency/effectiveness in relation to other approaches at the
state-of-the-art. The solution is based on the combination of an inverted index
with succinct data structures, a much less explored direction in the
literature. This system is replacing the previous implementation based on
Apache SOLR that was not always able to meet the required
service-level-agreement.Comment: Published in SIGIR 202
Immunohistochemical markers of stem/progenitor cells in the developing human cerebellum
The aim of this study was to identify the stem/progenitor cell markers, by immunohistochemistry, in order to highlight the cortical neurogenesis niches during the different gestational ages. To this end, the following stem/progenitor cell markers have been utilized: Sox2, Pax6, Pax2, WT1, Nestin, Vimentin and Calretinin. The expression of these markers have been compared with the expression of those markers of mature neurons and glial markers including Neurofilament (NF), neuron specific enolase (NSE), Synaptophysin (Syn), Glial fibrillary acid protein (GFAP) and S100β.
Evaluating both stem/progenitor and mature cell markers lead to the identification of multiple stages of differentiation of neuronal and glial progenitors during gestation, in order to better understand the development of human cerebellar cortex
Molecular approaches fighting nonsense
: Nonsense mutations are the result of single nucleotide substitutions in the DNA that change a sense codon (coding for an amino acid) to a nonsense or premature termination codon (PTC) within the coding region of the mRNA [...]
Espacio público y agua en las ciudades europeas
Análisis de proyectos de regeneración urbana en las orillas de ciudades europeas según 3 categorÃas de inclusión, integración y superposición, con el fin de: i) analizar el uso del agua en el diseño de la forma urbana contemporánea y las relaciones que el espacio público crea con el elemento liquido;
ii) identificar los instrumentos compositivos y conceptuales de los ejemplos
de regeneración urbana analizados, con la finalidad de orientar futuras intervenciones en el litoral de Istanbul
Single-Neuron Correlates of Social Identity in Freely Interacting Female Rats
The neuronal bases of social interaction are poorly understood, especially in terms of social motivation and social cognition. The present study used single-unit recording to investigate neuronal firing patterns in two inter-connected regions: the amygdala and the piriform cortex, focusing on the latter region. Olfactory signals strongly shape rodents’ social interaction: previous anatomical and physiological studies implicate the piriform cortex in olfactory pattern separation and pattern completion functions which could support memory for the odour profile of familiar conspecifics. As a first step in characterising the rodent social amygdala and piriform cortex, I paired Lister Hooded rats in an apparatus where they are fully free to engage in a variety of positive social interactions including anogenital sniffing, face to face contacts, and body contact. The apparatus was a 40x40 cm wooden square box with a wall height of 50 cm. I performed extracellular electrophysiological recordings from ensembles of single neurons tested in various social and non-social conditions (e.g. familiar rat in box vs empty box, or familiar rat vs novel rats). In addition, I simultaneously recorded behaviour with images time-stamped in synchronization with the electrophysiological recordings.
The present thesis shows that there are pyramidal neurons in the rodent amygdala and piriform cortex which respond strongly to social interaction. The main finding was that about 20% of cells in the piriform cortex showed firing patterns specific to either familiar or novel conspecifics, and a similar proportion showed firing patterns specific for one of two familiar sisters
Fast and Compact Set Intersection through Recursive Universe Partitioning
We present a data structure that encodes a sorted integer sequence in small space allowing, at the same time, fast intersection operations. The data layout is carefully designed to exploit word-level parallelism and SIMD instructions, hence providing good practical performance. The core algorithmic idea is that of recursive partitioning the universe of representation: A markedly different paradigm than the widespread strategy of partitioning the sequence based on its length. Extensive experimentation and comparison against several competitive techniques shows that the proposed solution embodies an improved space/time trade-off for the set intersection problem
- …