1,537 research outputs found
Query-Based Sampling using Only Snippets
Query-based sampling is a popular approach to model the content of an uncooperative server. It works by sending queries to the server and downloading the returned documents in the search results in full. This sample of documents then represents the server’s content. We present an approach that uses the document snippets as samples instead of downloading entire documents. This yields more stable results at the same amount of bandwidth usage as the full document approach. Additionally, we show that using snippets does not necessarily incur more latency, but can actually save time
A review of domain adaptation without target labels
Domain adaptation has become a prominent problem setting in machine learning
and related fields. This review asks the question: how can a classifier learn
from a source domain and generalize to a target domain? We present a
categorization of approaches, divided into, what we refer to as, sample-based,
feature-based and inference-based methods. Sample-based methods focus on
weighting individual observations during training based on their importance to
the target domain. Feature-based methods revolve around on mapping, projecting
and representing features such that a source classifier performs well on the
target domain and inference-based methods incorporate adaptation into the
parameter estimation procedure, for instance through constraints on the
optimization procedure. Additionally, we review a number of conditions that
allow for formulating bounds on the cross-domain generalization error. Our
categorization highlights recurring ideas and raises questions important to
further research.Comment: 20 pages, 5 figure
Exploration and Exploitation of Victorian Science in Darwin's Reading Notebooks
Search in an environment with an uncertain distribution of resources involves
a trade-off between exploitation of past discoveries and further exploration.
This extends to information foraging, where a knowledge-seeker shifts between
reading in depth and studying new domains. To study this decision-making
process, we examine the reading choices made by one of the most celebrated
scientists of the modern era: Charles Darwin. From the full-text of books
listed in his chronologically-organized reading journals, we generate topic
models to quantify his local (text-to-text) and global (text-to-past) reading
decisions using Kullback-Liebler Divergence, a cognitively-validated,
information-theoretic measure of relative surprise. Rather than a pattern of
surprise-minimization, corresponding to a pure exploitation strategy, Darwin's
behavior shifts from early exploitation to later exploration, seeking unusually
high levels of cognitive surprise relative to previous eras. These shifts,
detected by an unsupervised Bayesian model, correlate with major intellectual
epochs of his career as identified both by qualitative scholarship and Darwin's
own self-commentary. Our methods allow us to compare his consumption of texts
with their publication order. We find Darwin's consumption more exploratory
than the culture's production, suggesting that underneath gradual societal
changes are the explorations of individual synthesis and discovery. Our
quantitative methods advance the study of cognitive search through a framework
for testing interactions between individual and collective behavior and between
short- and long-term consumption choices. This novel application of topic
modeling to characterize individual reading complements widespread studies of
collective scientific behavior.Comment: Cognition pre-print, published February 2017; 22 pages, plus 17 pages
supporting information, 7 pages reference
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
- …