8,677 research outputs found
Recovering Grammar Relationships for the Java Language Specification
Grammar convergence is a method that helps discovering relationships between
different grammars of the same language or different language versions. The key
element of the method is the operational, transformation-based representation
of those relationships. Given input grammars for convergence, they are
transformed until they are structurally equal. The transformations are composed
from primitive operators; properties of these operators and the composed chains
provide quantitative and qualitative insight into the relationships between the
grammars at hand. We describe a refined method for grammar convergence, and we
use it in a major study, where we recover the relationships between all the
grammars that occur in the different versions of the Java Language
Specification (JLS). The relationships are represented as grammar
transformation chains that capture all accidental or intended differences
between the JLS grammars. This method is mechanized and driven by nominal and
structural differences between pairs of grammars that are subject to
asymmetric, binary convergence steps. We present the underlying operator suite
for grammar transformation in detail, and we illustrate the suite with many
examples of transformations on the JLS grammars. We also describe the
extraction effort, which was needed to make the JLS grammars amenable to
automated processing. We include substantial metadata about the convergence
process for the JLS so that the effort becomes reproducible and transparent
Automatic Accuracy Prediction for AMR Parsing
Abstract Meaning Representation (AMR) represents sentences as directed,
acyclic and rooted graphs, aiming at capturing their meaning in a machine
readable format. AMR parsing converts natural language sentences into such
graphs. However, evaluating a parser on new data by means of comparison to
manually created AMR graphs is very costly. Also, we would like to be able to
detect parses of questionable quality, or preferring results of alternative
systems by selecting the ones for which we can assess good quality. We propose
AMR accuracy prediction as the task of predicting several metrics of
correctness for an automatically generated AMR parse - in absence of the
corresponding gold parse. We develop a neural end-to-end multi-output
regression model and perform three case studies: firstly, we evaluate the
model's capacity of predicting AMR parse accuracies and test whether it can
reliably assign high scores to gold parses. Secondly, we perform parse
selection based on predicted parse accuracies of candidate parses from
alternative systems, with the aim of improving overall results. Finally, we
predict system ranks for submissions from two AMR shared tasks on the basis of
their predicted parse accuracy averages. All experiments are carried out across
two different domains and show that our method is effective.Comment: accepted at *SEM 201
How FAIR can you get? Image Retrieval as a Use Case to calculate FAIR Metrics
A large number of services for research data management strive to adhere to
the FAIR guiding principles for scientific data management and stewardship. To
evaluate these services and to indicate possible improvements, use-case-centric
metrics are needed as an addendum to existing metric frameworks. The retrieval
of spatially and temporally annotated images can exemplify such a use case. The
prototypical implementation indicates that currently no research data
repository achieves the full score. Suggestions on how to increase the score
include automatic annotation based on the metadata inside the image file and
support for content negotiation to retrieve the images. These and other
insights can lead to an improvement of data integration workflows, resulting in
a better and more FAIR approach to manage research data.Comment: This is a preprint for a paper accepted for the 2018 IEEE conferenc
A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
Ortholog detection (OD) is a critical step for comparative genomic analysis
of protein-coding sequences. In this paper, we begin with a comprehensive
comparison of four popular, methodologically diverse OD methods: MultiParanoid,
Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to
significantly outperform one another 12-30% of the time. This high
complementarity motivates the presentation of the first tool for integrating
methodologically diverse OD methods. We term this program MOSAIC, or Multiple
Orthologous Sequence Analysis and Integration by Cluster optimization. Relative
to component and competing methods, we demonstrate that MOSAIC more than
quintuples the number of alignments for which all species are present, while
simultaneously maintaining or improving functional-, phylogenetic-, and
sequence identity-based measures of ortholog quality. Further, we demonstrate
that this improvement in alignment quality yields 40-280% more confidently
aligned sites. Combined, these factors translate to higher estimated levels of
overall conservation, while at the same time allowing for the detection of up
to 180% more positively selected sites. MOSAIC is available as python package.
MOSAIC alignments, source code, and full documentation are available at
http://pythonhosted.org/bio-MOSAIC
Prediction of Metabolic Pathways Involvement in Prokaryotic UniProtKB Data by Association Rule Mining
The widening gap between known proteins and their functions has encouraged
the development of methods to automatically infer annotations. Automatic
functional annotation of proteins is expected to meet the conflicting
requirements of maximizing annotation coverage, while minimizing erroneous
functional assignments. This trade-off imposes a great challenge in designing
intelligent systems to tackle the problem of automatic protein annotation. In
this work, we present a system that utilizes rule mining techniques to predict
metabolic pathways in prokaryotes. The resulting knowledge represents
predictive models that assign pathway involvement to UniProtKB entries. We
carried out an evaluation study of our system performance using
cross-validation technique. We found that it achieved very promising results in
pathway identification with an F1-measure of 0.982 and an AUC of 0.987. Our
prediction models were then successfully applied to 6.2 million
UniProtKB/TrEMBL reference proteome entries of prokaryotes. As a result,
663,724 entries were covered, where 436,510 of them lacked any previous pathway
annotations
An ontology to standardize research output of nutritional epidemiology : from paper-based standards to linked content
Background: The use of linked data in the Semantic Web is a promising approach to add value to nutrition research. An ontology, which defines the logical relationships between well-defined taxonomic terms, enables linking and harmonizing research output. To enable the description of domain-specific output in nutritional epidemiology, we propose the Ontology for Nutritional Epidemiology (ONE) according to authoritative guidance for nutritional epidemiology.
Methods: Firstly, a scoping review was conducted to identify existing ontology terms for reuse in ONE. Secondly, existing data standards and reporting guidelines for nutritional epidemiology were converted into an ontology. The terms used in the standards were summarized and listed separately in a taxonomic hierarchy. Thirdly, the ontologies of the nutritional epidemiologic standards, reporting guidelines, and the core concepts were gathered in ONE. Three case studies were included to illustrate potential applications: (i) annotation of existing manuscripts and data, (ii) ontology-based inference, and (iii) estimation of reporting completeness in a sample of nine manuscripts.
Results: Ontologies for food and nutrition (n = 37), disease and specific population (n = 100), data description (n = 21), research description (n = 35), and supplementary (meta) data description (n = 44) were reviewed and listed. ONE consists of 339 classes: 79 new classes to describe data and 24 new classes to describe the content of manuscripts.
Conclusion: ONE is a resource to automate data integration, searching, and browsing, and can be used to assess reporting completeness in nutritional epidemiology
- …