16,065 research outputs found
The NASA Astrophysics Data System: Data Holdings
Since its inception in 1993, the ADS Abstract Service has become an
indispensable research tool for astronomers and astrophysicists worldwide. In
those seven years, much effort has been directed toward improving both the
quantity and the quality of references in the database. From the original
database of approximately 160,000 astronomy abstracts, our dataset has grown
almost tenfold to approximately 1.5 million references covering astronomy,
astrophysics, planetary sciences, physics, optics, and engineering. We collect
and standardize data from approximately 200 journals and present the resulting
information in a uniform, coherent manner. With the cooperation of journal
publishers worldwide, we have been able to place scans of full journal articles
on-line back to the first volumes of many astronomical journals, and we are
able to link to current version of articles, abstracts, and datasets for
essentially all of the current astronomy literature. The trend toward
electronic publishing in the field, the use of electronic submission of
abstracts for journal articles and conference proceedings, and the increasingly
prominent use of the World Wide Web to disseminate information have enabled the
ADS to build a database unparalleled in other disciplines.
The ADS can be accessed at http://adswww.harvard.eduComment: 24 pages, 1 figure, 6 tables, 3 appendice
Automatically assembling a full census of an academic field
The composition of the scientific workforce shapes the direction of
scientific research, directly through the selection of questions to
investigate, and indirectly through its influence on the training of future
scientists. In most fields, however, complete census information is difficult
to obtain, complicating efforts to study workforce dynamics and the effects of
policy. This is particularly true in computer science, which lacks a single,
all-encompassing directory or professional organization. A full census of
computer science would serve many purposes, not the least of which is a better
understanding of the trends and causes of unequal representation in computing.
Previous academic census efforts have relied on narrow or biased samples, or on
professional society membership rolls. A full census can be constructed
directly from online departmental faculty directories, but doing so by hand is
prohibitively expensive and time-consuming. Here, we introduce a topical web
crawler for automating the collection of faculty information from web-based
department rosters, and demonstrate the resulting system on the 205
PhD-granting computer science departments in the U.S. and Canada. This method
constructs a complete census of the field within a few minutes, and achieves
over 99% precision and recall. We conclude by comparing the resulting 2017
census to a hand-curated 2011 census to quantify turnover and retention in
computer science, in general and for female faculty in particular,
demonstrating the types of analysis made possible by automated census
construction.Comment: 11 pages, 6 figures, 2 table
The NASA Astrophysics Data System: Architecture
The powerful discovery capabilities available in the ADS bibliographic
services are possible thanks to the design of a flexible search and retrieval
system based on a relational database model. Bibliographic records are stored
as a corpus of structured documents containing fielded data and metadata, while
discipline-specific knowledge is segregated in a set of files independent of
the bibliographic data itself.
The creation and management of links to both internal and external resources
associated with each bibliography in the database is made possible by
representing them as a set of document properties and their attributes.
To improve global access to the ADS data holdings, a number of mirror sites
have been created by cloning the database contents and software on a variety of
hardware and software platforms.
The procedures used to create and manage the database and its mirrors have
been written as a set of scripts that can be run in either an interactive or
unsupervised fashion.
The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table
Genie: A Generator of Natural Language Semantic Parsers for Virtual Assistant Commands
To understand diverse natural language commands, virtual assistants today are
trained with numerous labor-intensive, manually annotated sentences. This paper
presents a methodology and the Genie toolkit that can handle new compound
commands with significantly less manual effort. We advocate formalizing the
capability of virtual assistants with a Virtual Assistant Programming Language
(VAPL) and using a neural semantic parser to translate natural language into
VAPL code. Genie needs only a small realistic set of input sentences for
validating the neural model. Developers write templates to synthesize data;
Genie uses crowdsourced paraphrases and data augmentation, along with the
synthesized data, to train a semantic parser. We also propose design principles
that make VAPL languages amenable to natural language translation. We apply
these principles to revise ThingTalk, the language used by the Almond virtual
assistant. We use Genie to build the first semantic parser that can support
compound virtual assistants commands with unquoted free-form parameters. Genie
achieves a 62% accuracy on realistic user inputs. We demonstrate Genie's
generality by showing a 19% and 31% improvement over the previous state of the
art on a music skill, aggregate functions, and access control.Comment: To appear in PLDI 201
"Needless to Say My Proposal Was Turned Down": The Early Days of Commercial Citation Indexing, an "Error-making" Activity and Its Repercussions Till Today
In todayâs neoliberal audit cultures university rankings, quantitative evaluation of publications by JIF or researchers by h-index are believed to be indispensable instruments for âquality assuranceâ in the sciences. Yet there is increasing resistance against âimpactitisâ and âevaluitisâ. Usually overseen: Trivial errors in Thomson Reutersâ citation indexes produce severe non-trivial effects: Their victims are authors, institutions, journals with names beyond the ASCII-code and scholars of humanities and social sciences. Analysing the âJoshua Lederberg Papersâ I want to illuminate eventually successful âinventionâ of science citation indexing is a product of contingent factors. To overcome severe resistance Eugene Garfield, the âfatherâ of citation indexing, had to foster overoptimistic attitudes and to downplay the severe problems connected to global and multidisciplinary citation indexing. The difficulties to handle different formats of references and footnotes, non-Anglo-American names, and of publications in non-English languages were known to the pioneers of citation indexing. Nowadays the huge for-profit North-American media corporation Thomson Reuters is the owner of the citation databases founded by Garfield. Thomson Reutersâ influence on funding decisions, individual careers, departments, universities, disciplines and countries is immense and ambivalent. Huge technological systems show a heavy inertness. This insight of technology studies is applicable to the large citation indexes by Thomson Reuters, too
SWI-Prolog and the Web
Where Prolog is commonly seen as a component in a Web application that is
either embedded or communicates using a proprietary protocol, we propose an
architecture where Prolog communicates to other components in a Web application
using the standard HTTP protocol. By avoiding embedding in external Web servers
development and deployment become much easier. To support this architecture, in
addition to the transfer protocol, we must also support parsing, representing
and generating the key Web document types such as HTML, XML and RDF.
This paper motivates the design decisions in the libraries and extensions to
Prolog for handling Web documents and protocols. The design has been guided by
the requirement to handle large documents efficiently. The described libraries
support a wide range of Web applications ranging from HTML and XML documents to
Semantic Web RDF processing.
To appear in Theory and Practice of Logic Programming (TPLP)Comment: 31 pages, 24 figures and 2 tables. To appear in Theory and Practice
of Logic Programming (TPLP
- âŠ