410 research outputs found
Automatic abstracting: a review and an empirical evaluation
The abstract is a fundamental tool in information retrieval. As condensed representations,
they facilitate conservation of the increasingly precious search time and space of scholars, allowing them to manage more effectively an ever-growing deluge of documentation.
Traditionally the product of human intellectual effort, attempts to automate the abstracting
process began in 1958. Two identifiable automatic abstracting techniques emerged which
reflect differing levels of ambition regarding simulation of the human abstracting process,
namely sentence extraction and text summarisation. This research paradigm has recently
diversified further, with a cross-fertilisation of methods. Commercial systems are beginning
to appear, but automatic abstracting is still mainly confined to an experimental arena.
The purpose of this study is firstly to chart the historical development and current state of
both manual and automatic abstracting; and secondly, to devise and implement an empirical
user-based evaluation to assess the adequacy of automatic abstracts derived from sentence
extraction techniques according to a set of utility criteria. [Continues.
Privacy-preserving data outsourcing in the cloud via semantic data splitting
Even though cloud computing provides many intrinsic benefits, privacy
concerns related to the lack of control over the storage and management of the
outsourced data still prevent many customers from migrating to the cloud.
Several privacy-protection mechanisms based on a prior encryption of the data
to be outsourced have been proposed. Data encryption offers robust security,
but at the cost of hampering the efficiency of the service and limiting the
functionalities that can be applied over the (encrypted) data stored on cloud
premises. Because both efficiency and functionality are crucial advantages of
cloud computing, in this paper we aim at retaining them by proposing a
privacy-protection mechanism that relies on splitting (clear) data, and on the
distributed storage offered by the increasingly popular notion of multi-clouds.
We propose a semantically-grounded data splitting mechanism that is able to
automatically detect pieces of data that may cause privacy risks and split them
on local premises, so that each chunk does not incur in those risks; then,
chunks of clear data are independently stored into the separate locations of a
multi-cloud, so that external entities cannot have access to the whole
confidential data. Because partial data are stored in clear on cloud premises,
outsourced functionalities are seamlessly and efficiently supported by just
broadcasting queries to the different cloud locations. To enforce a robust
privacy notion, our proposal relies on a privacy model that offers a priori
privacy guarantees; to ensure its feasibility, we have designed heuristic
algorithms that minimize the number of cloud storage locations we need; to show
its potential and generality, we have applied it to the least structured and
most challenging data type: plain textual documents
Supervised Learning with Similarity Functions
We address the problem of general supervised learning when data can only be
accessed through an (indefinite) similarity function between data points.
Existing work on learning with indefinite kernels has concentrated solely on
binary/multi-class classification problems. We propose a model that is generic
enough to handle any supervised learning task and also subsumes the model
previously proposed for classification. We give a "goodness" criterion for
similarity functions w.r.t. a given supervised learning task and then adapt a
well-known landmarking technique to provide efficient algorithms for supervised
learning using "good" similarity functions. We demonstrate the effectiveness of
our model on three important super-vised learning problems: a) real-valued
regression, b) ordinal regression and c) ranking where we show that our method
guarantees bounded generalization error. Furthermore, for the case of
real-valued regression, we give a natural goodness definition that, when used
in conjunction with a recent result in sparse vector recovery, guarantees a
sparse predictor with bounded generalization error. Finally, we report results
of our learning algorithms on regression and ordinal regression tasks using
non-PSD similarity functions and demonstrate the effectiveness of our
algorithms, especially that of the sparse landmark selection algorithm that
achieves significantly higher accuracies than the baseline methods while
offering reduced computational costs.Comment: To appear in the proceedings of NIPS 2012, 30 page
Memory Structure and Cognitive Maps
A common way to understand memory structures in the cognitive sciences is as a cognitive mapā.
Cognitive maps are representational systems organized by dimensions shared with physical space. The
appeal to these maps begins literally: as an account of how spatial information is represented and used
to inform spatial navigation. Invocations of cognitive maps, however, are often more ambitious;
cognitive maps are meant to scale up and provide the basis for our more sophisticated memory
capacities. The extension is not meant to be metaphorical, but the way in which these richer mental
structures are supposed to remain map-like is rarely made explicit. Here we investigate this missing
link, asking: how do cognitive maps represent non-spatial information?ā We begin with a survey of
foundational work on spatial cognitive maps and then provide a comparative review of alternative,
non-spatial representational structures. We then turn to several cutting-edge projects that are engaged
in the task of scaling up cognitive maps so as to accommodate non-spatial information: first, on the
spatial-isometric approachā , encoding content that is non-spatial but in some sense isomorphic to
spatial content; second, on the ā abstraction approachā , encoding content that is an abstraction over
first-order spatial information; and third, on the ā embedding approachā , embedding non-spatial
information within a spatial context, a prominent example being the Method-of-Loci. Putting these
cases alongside one another reveals the variety of options available for building cognitive maps, and the
distinctive limitations of each. We conclude by reflecting on where these results take us in terms of
understanding the place of cognitive maps in memory
Users' perception of relevance of spoken documents
We present the results of a study of user's perception of relevance of documents. The aim is to study experimentally how users' perception varies depending on the form that retrieved documents are presented. Documents retrieved in response to a query are presented to users in a variety of ways, from full text to a machine spoken query-biased automatically-generated summary, and the difference in users' perception of relevance is studied. The experimental results suggest that the effectiveness of advanced multimedia information retrieval applications may be affected by the low level of users' perception of relevance of retrieved documents
Analysis of Crowdsourced Sampling Strategies for HodgeRank with Sparse Random Graphs
Crowdsourcing platforms are now extensively used for conducting subjective
pairwise comparison studies. In this setting, a pairwise comparison dataset is
typically gathered via random sampling, either \emph{with} or \emph{without}
replacement. In this paper, we use tools from random graph theory to analyze
these two random sampling methods for the HodgeRank estimator. Using the
Fiedler value of the graph as a measurement for estimator stability
(informativeness), we provide a new estimate of the Fiedler value for these two
random graph models. In the asymptotic limit as the number of vertices tends to
infinity, we prove the validity of the estimate. Based on our findings, for a
small number of items to be compared, we recommend a two-stage sampling
strategy where a greedy sampling method is used initially and random sampling
\emph{without} replacement is used in the second stage. When a large number of
items is to be compared, we recommend random sampling with replacement as this
is computationally inexpensive and trivially parallelizable. Experiments on
synthetic and real-world datasets support our analysis
- ā¦