19,369 research outputs found
The NASA Astrophysics Data System: Architecture
The powerful discovery capabilities available in the ADS bibliographic
services are possible thanks to the design of a flexible search and retrieval
system based on a relational database model. Bibliographic records are stored
as a corpus of structured documents containing fielded data and metadata, while
discipline-specific knowledge is segregated in a set of files independent of
the bibliographic data itself.
The creation and management of links to both internal and external resources
associated with each bibliography in the database is made possible by
representing them as a set of document properties and their attributes.
To improve global access to the ADS data holdings, a number of mirror sites
have been created by cloning the database contents and software on a variety of
hardware and software platforms.
The procedures used to create and manage the database and its mirrors have
been written as a set of scripts that can be run in either an interactive or
unsupervised fashion.
The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table
Competitive function approximation for reinforcement learning
The application of reinforcement learning to problems with continuous domains requires representing the value function by means of function approximation. We identify two aspects of reinforcement learning that make the function approximation process hard: non-stationarity of the target function and biased sampling. Non-stationarity is the result of the bootstrapping nature of dynamic programming where the value function is estimated using its current approximation. Biased sampling occurs when some regions of the state space are visited too often, causing a reiterated updating with similar values which fade out the occasional updates of infrequently sampled regions.
We propose a competitive approach for function approximation where many different local approximators are available at a given input and the one with expectedly best approximation is selected by means of a relevance function. The local nature of the approximators allows their fast adaptation to non-stationary changes and mitigates the biased sampling problem. The coexistence of multiple approximators updated and tried in parallel permits obtaining a good estimation much faster than would be possible with a single approximator. Experiments in different benchmark problems show that the competitive strategy provides a faster and more stable learning than non-competitive approaches.Preprin
Incremental Entity Resolution from Linked Documents
In many government applications we often find that information about
entities, such as persons, are available in disparate data sources such as
passports, driving licences, bank accounts, and income tax records. Similar
scenarios are commonplace in large enterprises having multiple customer,
supplier, or partner databases. Each data source maintains different aspects of
an entity, and resolving entities based on these attributes is a well-studied
problem. However, in many cases documents in one source reference those in
others; e.g., a person may provide his driving-licence number while applying
for a passport, or vice-versa. These links define relationships between
documents of the same entity (as opposed to inter-entity relationships, which
are also often used for resolution). In this paper we describe an algorithm to
cluster documents that are highly likely to belong to the same entity by
exploiting inter-document references in addition to attribute similarity. Our
technique uses a combination of iterative graph-traversal, locality-sensitive
hashing, iterative match-merge, and graph-clustering to discover unique
entities based on a document corpus. A unique feature of our technique is that
new sets of documents can be added incrementally while having to re-resolve
only a small subset of a previously resolved entity-document collection. We
present performance and quality results on two data-sets: a real-world database
of companies and a large synthetically generated `population' database. We also
demonstrate benefit of using inter-document references for clustering in the
form of enhanced recall of documents for resolution.Comment: 15 pages, 8 figures, patented wor
Seismic Ray Impedance Inversion
This thesis investigates a prestack seismic inversion scheme implemented in the ray
parameter domain. Conventionally, most prestack seismic inversion methods are
performed in the incidence angle domain. However, inversion using the concept of
ray impedance, as it honours ray path variation following the elastic parameter
variation according to Snell’s law, shows the capacity to discriminate different
lithologies if compared to conventional elastic impedance inversion.
The procedure starts with data transformation into the ray-parameter domain and then
implements the ray impedance inversion along constant ray-parameter profiles. With
different constant-ray-parameter profiles, mixed-phase wavelets are initially estimated
based on the high-order statistics of the data and further refined after a proper well-to-seismic
tie. With the estimated wavelets ready, a Cauchy inversion method is used to
invert for seismic reflectivity sequences, aiming at recovering seismic reflectivity
sequences for blocky impedance inversion. The impedance inversion from reflectivity
sequences adopts a standard generalised linear inversion scheme, whose results are
utilised to identify rock properties and facilitate quantitative interpretation. It has also
been demonstrated that we can further invert elastic parameters from ray impedance
values, without eliminating an extra density term or introducing a Gardner’s relation
to absorb this term.
Ray impedance inversion is extended to P-S converted waves by introducing the
definition of converted-wave ray impedance. This quantity shows some advantages in
connecting prestack converted wave data with well logs, if compared with the shearwave
elastic impedance derived from the Aki and Richards approximation to the
Zoeppritz equations. An analysis of P-P and P-S wave data under the framework of
ray impedance is conducted through a real multicomponent dataset, which can reduce
the uncertainty in lithology identification.Inversion is the key method in generating those examples throughout the entire thesis
as we believe it can render robust solutions to geophysical problems. Apart from the
reflectivity sequence, ray impedance and elastic parameter inversion mentioned above,
inversion methods are also adopted in transforming the prestack data from the offset
domain to the ray-parameter domain, mixed-phase wavelet estimation, as well as the
registration of P-P and P-S waves for the joint analysis.
The ray impedance inversion methods are successfully applied to different types of
datasets. In each individual step to achieving the ray impedance inversion, advantages,
disadvantages as well as limitations of the algorithms adopted are detailed. As a
conclusion, the ray impedance related analyses demonstrated in this thesis are highly
competent compared with the classical elastic impedance methods and the author
would like to recommend it for a wider application
Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising
Sponsored search represents a major source of revenue for web search engines.
This popular advertising model brings a unique possibility for advertisers to
target users' immediate intent communicated through a search query, usually by
displaying their ads alongside organic search results for queries deemed
relevant to their products or services. However, due to a large number of
unique queries it is challenging for advertisers to identify all such relevant
queries. For this reason search engines often provide a service of advanced
matching, which automatically finds additional relevant queries for advertisers
to bid on. We present a novel advanced matching approach based on the idea of
semantic embeddings of queries and ads. The embeddings were learned using a
large data set of user search sessions, consisting of search queries, clicked
ads and search links, while utilizing contextual information such as dwell time
and skipped ads. To address the large-scale nature of our problem, both in
terms of data and vocabulary size, we propose a novel distributed algorithm for
training of the embeddings. Finally, we present an approach for overcoming a
cold-start problem associated with new ads and queries. We report results of
editorial evaluation and online tests on actual search traffic. The results
show that our approach significantly outperforms baselines in terms of
relevance, coverage, and incremental revenue. Lastly, we open-source learned
query embeddings to be used by researchers in computational advertising and
related fields.Comment: 10 pages, 4 figures, 39th International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR 2016, Pisa, Ital
A Practical Searchable Symmetric Encryption Scheme for Smart Grid Data
Outsourcing data storage to the remote cloud can be an economical solution to
enhance data management in the smart grid ecosystem. To protect the privacy of
data, the utility company may choose to encrypt the data before uploading them
to the cloud. However, while encryption provides confidentiality to data, it
also sacrifices the data owners' ability to query a special segment in their
data. Searchable symmetric encryption is a technology that enables users to
store documents in ciphertext form while keeping the functionality to search
keywords in the documents. However, most state-of-the-art SSE algorithms are
only focusing on general document storage, which may become unsuitable for
smart grid applications. In this paper, we propose a simple, practical SSE
scheme that aims to protect the privacy of data generated in the smart grid.
Our scheme achieves high space complexity with small information disclosure
that was acceptable for practical smart grid application. We also implement a
prototype over the statistical data of advanced meter infrastructure to show
the effectiveness of our approach
- …