5,586 research outputs found
Building a Disciplinary, World-Wide Data Infrastructure
Sharing scientific data, with the objective of making it fully discoverable,
accessible, assessable, intelligible, usable, and interoperable, requires work
at the disciplinary level to define in particular how the data should be
formatted and described. Each discipline has its own organization and history
as a starting point, and this paper explores the way a range of disciplines,
namely materials science, crystallography, astronomy, earth sciences,
humanities and linguistics get organized at the international level to tackle
this question. In each case, the disciplinary culture with respect to data
sharing, science drivers, organization and lessons learnt are briefly
described, as well as the elements of the specific data infrastructure which
are or could be shared with others. Commonalities and differences are assessed.
Common key elements for success are identified: data sharing should be science
driven; defining the disciplinary part of the interdisciplinary standards is
mandatory but challenging; sharing of applications should accompany data
sharing. Incentives such as journal and funding agency requirements are also
similar. For all, it also appears that social aspects are more challenging than
technological ones. Governance is more diverse, and linked to the discipline
organization. CODATA, the RDA and the WDS can facilitate the establishment of
disciplinary interoperability frameworks. Being problem-driven is also a key
factor of success for building bridges to enable interdisciplinary research.Comment: Proceedings of the session "Building a disciplinary, world-wide data
infrastructure" of SciDataCon 2016, held in Denver, CO, USA, 12-14 September
2016, to be published in ICSU CODATA Data Science Journal in 201
Recommended from our members
A short survey of discourse representation models
With the advancement of technology and the wide adoption of ontologies as knowledge representation formats, in the last decade, a handful of models were proposed for the externalization of the rhetoric and argumentation captured within scientific publications. Conceptually, most of these models share a similar representation form of the scientific publication, i.e. as a series of interconnected elementary knowledge items. The main differences are given by the terminology used, the types of rhetorical and/or argumentation relations connecting the knowledge items and the foundational theories supporting these relations. This paper analyzes the state of the art and provides a concise comparative overview of the ïŹve most prominent discourse representation models, with the goal of sketching an uniïŹed model for discourse representation
A cross-linguistic database of phonetic transcription systems
Contrary to what non-practitioners might expect, the systems of phonetic notation used by linguists are highly idiosyncratic. Not only do various linguistic subfields disagree on the specific symbols they use to denote the speech sounds of languages, but also in large databases of sound inventories considerable variation can be found. Inspired by recent efforts to link cross-linguistic data with help of reference catalogues (Glottolog, Concepticon) across different resources, we present initial efforts to link different phonetic notation systems to a catalogue of speech sounds. This is achieved with the help of a database accompanied by a software framework that uses a limited but easily extendable set of non-binary feature values to allow for quick and convenient registration of different transcription systems, while at the same time linking to additional datasets with restricted inventories. Linking different transcription systems enables us to conveniently translate between different phonetic transcription systems, while linking sounds to databases allows users quick access to various kinds of metadata, including feature values, statistics on phoneme inventories, and information on prosody and sound classes. In order to prove the feasibility of this enterprise, we supplement an initial version of our cross-linguistic database of phonetic transcription systems (CLTS), which currently registers five transcription systems and links to fifteen datasets, as well as a web application, which permits users to conveniently test the power of the automatic translation across transcription systems
Language resources and linked data: a practical perspective
Recently, experts and practitioners in language resources
have started recognizing the benefits of the linked data (LD) paradigm
for the representation and exploitation of linguistic data on the Web.
The adoption of the LD principles is leading to an emerging ecosystem of
multilingual open resources that conform to the Linguistic Linked Open
Data Cloud, in which datasets of linguistic data are interconnected and
represented following common vocabularies, which facilitates linguistic
information discovery, integration and access. In order to contribute to
this initiative, this paper summarizes several key aspects of the representation
of linguistic information as linked data from a practical perspective.
The main goal of this document is to provide the basic ideas and
tools for migrating language resources (lexicons, corpora, etc.) as LD on
the Web and to develop some useful NLP tasks with them (e.g., word
sense disambiguation). Such material was the basis of a tutorial imparted
at the EKAWâ14 conference, which is also reported in the paper
Ontology: A Linked Data Hub for Mathematics
In this paper, we present an ontology of mathematical knowledge concepts that
covers a wide range of the fields of mathematics and introduces a balanced
representation between comprehensive and sensible models. We demonstrate the
applications of this representation in information extraction, semantic search,
and education. We argue that the ontology can be a core of future integration
of math-aware data sets in the Web of Data and, therefore, provide mappings
onto relevant datasets, such as DBpedia and ScienceWISE.Comment: 15 pages, 6 images, 1 table, Knowledge Engineering and the Semantic
Web - 5th International Conferenc
Interoperability of language-related information: mapping the BLL Thesaurus to Lexvo and Glottolog
Since 2013, the thesaurus of the Bibliography of Linguistic Literature (BLL Thesaurus) has been applied in the context of the Linguistik
portal, a hub for linguistically relevant information. Several consecutive projects focus on the modeling of the BLL Thesaurus as ontology and its linking to terminological repositories in the Linguistic Linked Open Data (LLOD) cloud. Those mappings facilitate the connection between the Linguistik portal and the cloud. In the paper, we describe the current efforts to establish interoperability between the language-related index terms and repositories providing language identifiers for the web of Linked Data. After an introduction of Lexvo and Glottolog, we outline the scope, the structure, and the peculiarities of the BLL Thesaurus. We discuss the challenges for the design of scientifically plausible language classification and the linking between divergent classifications. We describe the prototype of the linking model and propose pragmatic solutions for structural or conceptual conflicts. Additionally, we depict the benefits from the envisaged interoperability - for the Linguistik portal, and the Linked Open Data Community in general
How FAIR are CMC Corpora?
In recent years, research data management has also become an important topic in the less data-intensive areas of the Social Sciences and Humanities (SSH). Funding agencies as well as research communities demand that empirical data collected and used for scientific research is managed and preserved in a way that research results are reproducible. In order to account for this the FAIR guiding principles for data stewardship have been established as a framework for good data management, aiming at the findability, accessibility, interoperability, and reusability of research data. This article investigates 24 European CMC corpora with regard to their compliance with the FAIR principles and discusses to what extent the deposit of research data in repositories of data preservation initiatives such as CLARIN, Zenodo or Metashare can assist in the provision of FAIR corpora
Models to represent linguistic linked data
As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service-oriented ones (group 4). Contributions encompassed in these four groups are described, highlighting their reuse by the community and the modelling challenges that are still to be faced
- âŠ