4,188 research outputs found
Implications of Computational Cognitive Models for Information Retrieval
This dissertation explores the implications of computational cognitive modeling for information retrieval. The parallel between information retrieval and human memory is that the goal of an information retrieval system is to find the set of documents most relevant to the query whereas the goal for the human memory system is to access the relevance of items stored in memory given a memory probe (Steyvers & Griffiths, 2010).
The two major topics of this dissertation are desirability and information scent. Desirability is the context independent probability of an item receiving attention (Recker & Pitkow, 1996). Desirability has been widely utilized in numerous experiments to model the probability that a given memory item would be retrieved (Anderson, 2007). Information scent is a context dependent measure defined as the utility of an information item (Pirolli & Card, 1996b). Information scent has been widely utilized to predict the memory item that would be retrieved given a probe (Anderson, 2007) and to predict the browsing behavior of humans (Pirolli & Card, 1996b).
In this dissertation, I proposed the theory that desirability observed in human memory is caused by preferential attachment in networks. Additionally, I showed that documents accessed in large repositories mirror the observed statistical properties in human memory and that these properties can be used to improve document ranking. Finally, I showed that the combination of information scent and desirability improves document ranking over existing well-established approaches
Measuring academic influence: Not all citations are equal
The importance of a research article is routinely measured by counting how
many times it has been cited. However, treating all citations with equal weight
ignores the wide variety of functions that citations perform. We want to
automatically identify the subset of references in a bibliography that have a
central academic influence on the citing paper. For this purpose, we examine
the effectiveness of a variety of features for determining the academic
influence of a citation. By asking authors to identify the key references in
their own work, we created a data set in which citations were labeled according
to their academic influence. Using automatic feature selection with supervised
machine learning, we found a model for predicting academic influence that
achieves good performance on this data set using only four features. The best
features, among those we evaluated, were those based on the number of times a
reference is mentioned in the body of a citing paper. The performance of these
features inspired us to design an influence-primed h-index (the hip-index).
Unlike the conventional h-index, it weights citations by how many times a
reference is mentioned. According to our experiments, the hip-index is a better
indicator of researcher performance than the conventional h-index
Identifying Citation Contexts: a Review of Strategies and Goals.
The Citation Contexts of a cited entity can be seen as little tesserae that, fit together, can be exploited to follow the opinion of the scientific community towards that entity as well as to summarize its most important contents. This mosaic is an excellent resource of information also for identifying topic specific synonyms, indexing terms and citers’ motivations, i.e. the reasons why authors cite other works. Is a paper cited for comparison, as a source of data or just for additional info? What is the polarity of a citation? Different reasons for citing reveal also different weights of the citations and different impacts of the cited authors that go beyond the mere citation count metrics. Identifying the appropriate Citation Context is the first step toward a multitude of possible analysis and researches. So far, Citation Context have been defined in several ways in literature, related to different purposes, domains and applications. In this paper we present different dimensions of Citation Context investigated by researchers through the years in order to provide an introductory review of the topic to anyone approaching this subject.Possiamo pensare ai Contesti Citazionali come tante tessere che, unite, possono essere sfruttate per seguire l’opinione della comunità scientifica riguardo ad un determinato lavoro o per riassumerne i contenuti più importanti. Questo mosaico di informazioni può essere utilizzato per identificare sinonimi specifici e Index Terms nonchè per individuare i motivi degli autori dietro le citazioni. Identificare il Contesto Citazionale ottimale è il primo passo per numerose analisi e ricerche. Il Contesto Citazionale è stato definito in diversi modi in letteratura, in relazione a differenti scopi, domini e applicazioni. In questo paper presentiamo le principali dimensioni testuali di Contesto Citazionale investigate dai ricercatori nel corso degli anni
Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain
The amount of archaeological literature is growing rapidly. Until recently,
these data were only accessible through metadata search. We implemented a text
retrieval engine for a large archaeological text collection ( Million
words). In archaeological IR, domain-specific entities such as locations, time
periods, and artefacts, play a central role. This motivated the development of
a named entity recognition (NER) model to annotate the full collection with
archaeological named entities. In this paper, we present ArcheoBERTje, a BERT
model pre-trained on Dutch archaeological texts. We compare the model's quality
and output on a Named Entity Recognition task to a generic multilingual model
and a generic Dutch model. We also investigate ensemble methods for combining
multiple BERT models, and combining the best BERT model with a domain thesaurus
using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms
both the multilingual and Dutch model significantly with a smaller standard
deviation between runs, reaching an average F1 score of 0.735. The model also
outperforms ensemble methods combining the three models. Combining ArcheoBERTje
predictions and explicit domain knowledge from the thesaurus did not increase
the F1 score. We quantitatively and qualitatively analyse the differences
between the vocabulary and output of the BERT models on the full collection and
provide some valuable insights in the effect of fine-tuning for specific
domains. Our results indicate that for a highly specific text domain such as
archaeology, further pre-training on domain-specific data increases the model's
quality on NER by a much larger margin than shown for other domains in the
literature, and that domain-specific pre-training makes the addition of domain
knowledge from a thesaurus unnecessary
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression – the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Recommended from our members
Automated Vocabulary Discovery for Geo-Parsing Online Epidemic Intelligence
Background Automated surveillance of the Internet provides a timely and sensitive method for alerting on global emerging infectious disease threats. HealthMap is part of a new generation of online systems designed to monitor and visualize, on a real-time basis, disease outbreak alerts as reported by online news media and public health sources. HealthMap is of specific interest for national and international public health organizations and international travelers. A particular task that makes such a surveillance useful is the automated discovery of the geographic references contained in the retrieved outbreak alerts. This task is sometimes referred to as "geo-parsing". A typical approach to geo-parsing would demand an expensive training corpus of alerts manually tagged by a human.Results Given that human readers perform this kind of task by using both their lexical and contextual knowledge, we developed an approach which relies on a relatively small expert-built gazetteer, thus limiting the need of human input, but focuses on learning the context in which geographic references appear. We show in a set of experiments, that this approach exhibits a substantial capacity to discover geographic locations outside of its initial lexicon.Conclusion The results of this analysis provide a framework for future automated global surveillance efforts that reduce manual input and improve timeliness of reporting
- …