20,630 research outputs found
Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development
Annotation graphs and annotation servers offer infrastructure to support the
analysis of human language resources in the form of time-series data such as
text, audio and video. This paper outlines areas of common need among empirical
linguists and computational linguists. After reviewing examples of data and
tools used or under development for each of several areas, it proposes a common
framework for future tool development, data annotation and resource sharing
based upon annotation graphs and servers.Comment: 8 pages, 6 figure
Identifying Data Sharing in Biomedical Literature
Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Automation and hypermedia technology applications
This paper represents a progress report on HyLite (Hypermedia Library technology): a research and development activity to produce a versatile system as part of NASA's technology thrusts in automation, information sciences, and communications. HyLite can be used as a system or tool to facilitate the creation and maintenance of large distributed electronic libraries. The contents of such a library may be software components, hardware parts or designs, scientific data sets or databases, configuration management information, etc. Proliferation of computer use has made the diversity and quantity of information too large for any single user to sort, process, and utilize effectively. In response to this information deluge, we have created HyLite to enable the user to process relevant information into a more efficient organization for presentation, retrieval, and readability. To accomplish this end, we have incorporated various AI techniques into the HyLite hypermedia engine to facilitate parameters and properties of the system. The proposed techniques include intelligent searching tools for the libraries, intelligent retrievals, and navigational assistance based on user histories. HyLite itself is based on an earlier project, the Encyclopedia of Software Components (ESC) which used hypermedia to facilitate and encourage software reuse
Next generation software environments : principles, problems, and research directions
The past decade has seen a burgeoning of research and development in software environments. Conferences have been devoted to the topic of practical environments, journal papers produced, and commercial systems sold. Given all the activity, one might expect a great deal of consensus on issues, approaches, and techniques. This is not the case, however. Indeed, the term "environment" is still used in a variety of conflicting ways. Nevertheless substantial progress has been made and we are at least nearing consensus on many critical issues.The purpose of this paper is to characterize environments, describe several important principles that have emerged in the last decade or so, note current open problems, and describe some approaches to these problems, with particular emphasis on the activities of one large-scale research program, the Arcadia project. Consideration is also given to two related topics: empirical evaluation and technology transition. That is, how can environments and their constituents be evaluated, and how can new developments be moved effectively into the production sector
Working Notes from the 1992 AAAI Workshop on Automating Software Design. Theme: Domain Specific Software Design
The goal of this workshop is to identify different architectural approaches to building domain-specific software design systems and to explore issues unique to domain-specific (vs. general-purpose) software design. Some general issues that cut across the particular software design domain include: (1) knowledge representation, acquisition, and maintenance; (2) specialized software design techniques; and (3) user interaction and user interface
Interoperability and FAIRness through a novel combination of Web technologies
Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special-purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved atthe level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs
Recommended from our members
BioC: a minimalist approach to interoperability for biomedical text processing
A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net
Towards Exascale Scientific Metadata Management
Advances in technology and computing hardware are enabling scientists from
all areas of science to produce massive amounts of data using large-scale
simulations or observational facilities. In this era of data deluge, effective
coordination between the data production and the analysis phases hinges on the
availability of metadata that describe the scientific datasets. Existing
workflow engines have been capturing a limited form of metadata to provide
provenance information about the identity and lineage of the data. However,
much of the data produced by simulations, experiments, and analyses still need
to be annotated manually in an ad hoc manner by domain scientists. Systematic
and transparent acquisition of rich metadata becomes a crucial prerequisite to
sustain and accelerate the pace of scientific innovation. Yet, ubiquitous and
domain-agnostic metadata management infrastructure that can meet the demands of
extreme-scale science is notable by its absence.
To address this gap in scientific data management research and practice, we
present our vision for an integrated approach that (1) automatically captures
and manipulates information-rich metadata while the data is being produced or
analyzed and (2) stores metadata within each dataset to permeate
metadata-oblivious processes and to query metadata through established and
standardized data access interfaces. We motivate the need for the proposed
integrated approach using applications from plasma physics, climate modeling
and neuroscience, and then discuss research challenges and possible solutions
- …