9 research outputs found

    Guidelines for multilingual linked data

    Get PDF
    In this article, we argue that there is a growing number of linked datasets in different natural languages, and that there is a need for guidelines and mechanisms to ensure the quality and organic growth of this emerging multilingual data network. However, we have little knowledge regarding the actual state of this data network, its current practices, and the open challenges that it poses. Questions regarding the distribution of natural languages, the links that are established across data in different languages, or how linguistic features are represented, remain mostly unanswered. Addressing these and other language-related issues can help to identify existing problems, propose new mechanisms and guidelines or adapt the ones in use for publishing linked data including language-related features, and, ultimately, provide metrics to evaluate quality aspects. In this article we review, discuss, and extend current guidelines for publishing linked data by focusing on those methods, techniques and tools that can help RDF publishers to cope with language barriers. Whenever possible, we will illustrate and discuss each of these guidelines, methods, and tools on the basis of practical examples that we have encountered in the publication of the datos.bne.es dataset

    Best practises for multilingual linked open data: a community effort

    Full text link
    The W3C Best Practises for Multilingual Linked Open Data community group was born one year ago during the last MLW workshop in Rome. Nowadays, it continues leading the effort of a numerous community towards acquiring a shared view of the issues caused by multilingualism on the Web of Data and their possible solutions. Despite our initial optimism, we found the task of identifying best practises for ML-LOD a difficult one, requiring a deep understanding of the Web of Data in its multilingual dimension and in its practical problems. In this talk we will review the progresses of the group so far, mainly in the identification and analysis of topics, use cases, and design patterns, as well as the future challenges

    An Approach to Publish Scientific Data of Open-Access Journals Using Linked Data Technologies

    Get PDF
    Semantic Web encourages digital libraries, including open access journals, to collect, link and share their data across the Web in order to ease its processing by machines and humans to get better queries and results. Linked Data technologies enable connecting related data across the Web using the principles and recommendations set out by Tim Berners-Lee in 2006. Several universities develop knowledge through scholarship and research with open access policies for the generated knowledge, using several ways to disseminate information. Open access journals collect, preserve and publish scientific information in digital form related to a particular academic discipline in a peer review process having a big potential for exchanging and spreading their data linked to external resources using Linked Data technologies. Linked Data can increase those benefits with better queries about the resources and their relationships. This paper reports a process for publishing scientific data on the Web using Linked Data technologies. Furthermore, methodological guidelines are presented with related activities. The proposed process was applied extracting data from a university Open Journal System and publishing in a SPARQL endpoint using the open source edition of OpenLink Virtuoso. In this process, the use of open standards facilitates the creation, development and exploitation of knowledge.This research has been partially supported by the Prometeo project by SENESCYT, Ecuadorian Government and by CEDIA (Consorcio Ecuatoriano para el Desarrollo de Internet Avanzado) supporting the project: “Platform for publishing library bibliographic resources using Linked Data technologies”

    Multilingual Variation in the context of Linked Data

    Get PDF
    In this paper we present a revisited classification of term variation in the light of the Linked Data initiative. Linked Data refers to a set of best practices for publishing and connecting structured data on the Web with the idea of transforming it into a global graph. One of the crucial steps of this initiative is the linking step, in which datasets in one or more languages need to be linked or connected with one another. We claim that the linking process would be facilitated if datasets are enriched with lexical and terminological information. Being that the final aim, we propose a classification of lexical, terminological and semantic variants that will become part of a model of linguistic descriptions that is currently being proposed within the framework of the W3C Ontology-Lexica Community Group to enrich ontologies and Linked Data vocabularies. Examples of modeling solutions of the different types of variants are also provided

    Multilingual variation in the context of linked data

    Get PDF
    Montiel-Ponsoda E, McCrae J, Aguado-de-Cea G, Gracia J. Multilingual variation in the context of linked data. In: Proceedings of the 10th International Conference on Terminology and Artificial Intelligence. 2013: 19-26

    Multilingualität und Linked Data

    Get PDF
    Cimiano P, Unger C. Multilingualität und Linked Data. In: Pellegrini T, Sack H, Auer S, eds. Linked Enterprise Data. Management und Bewirtschaftung vernetzter Unternehmensdaten mit Semantic Web Technologien. Berlin, Heidelberg: Springer; 2014: 153-175

    Models to represent linguistic linked data

    Get PDF
    As the interest of the Semantic Web and computational linguistics communities in linguistic linked data (LLD) keeps increasing and the number of contributions that dwell on LLD rapidly grows, scholars (and linguists in particular) interested in the development of LLD resources sometimes find it difficult to determine which mechanism is suitable for their needs and which challenges have already been addressed. This review seeks to present the state of the art on the models, ontologies and their extensions to represent language resources as LLD by focusing on the nature of the linguistic content they aim to encode. Four basic groups of models are distinguished in this work: models to represent the main elements of lexical resources (group 1), vocabularies developed as extensions to models in group 1 and ontologies that provide more granularity on specific levels of linguistic analysis (group 2), catalogues of linguistic data categories (group 3) and other models such as corpora models or service-oriented ones (group 4). Contributions encompassed in these four groups are described, highlighting their reuse by the community and the modelling challenges that are still to be faced

    Knowledge Portability with Semantic Expansion of Ontology Labels

    Get PDF
    Our research focuses on the multilingual enhancement of ontologies that, often represented only in English, need to be translated in different languages to enable knowledge access across languages. Ontology translation is a rather different task then the classic document translation, because ontologies contain highly specific vocabulary and they lack contextual information. For these reasons, to improve automatic ontology translations, we first focus on identifying relevant unambiguous and domain-specific sentences from a large set of generic parallel corpora. Then, we leverage Linked Open Data resources, such as DBPedia, to isolate ontologyspecific bilingual lexical knowledge. In both cases, we take advantage of the semantic information of the labels to select relevant bilingual data with the aim of building an ontology-specific statistical machine translation system. We evaluate our approach on the translation of a medical ontology, translating from English into German. Our experiment shows a significant improvement of around 3 BLEU points compared to a generic as well as a domain-specific translation approach

    Induction de lexiques bilingues à partir de corpus comparables et parallèles

    Full text link
    Les modèles statistiques tentent de généraliser la connaissance à partir de la fréquence des événements probabilistes présents dans les données. Si plus de données sont disponibles, les événements sont plus souvent observés et les modèles sont plus performants. Les approches du Traitement Automatique de la Langue basées sur ces modèles sont donc dépendantes de la disponibilité et de la quantité des ressources à disposition. Cette dépendance aux données touche en particulier la Traduction Automatique Statistique qui, de surcroît, requiert des ressources de type multilingue. Cette thèse rapporte quatre articles sur deux tâches qui contribuent de près à cette dépendance : l’Alignement de Documents Bilingues (ADB) et l’Induction de Lexiques Bilingues (ILB). La première publication décrit le système soumis à la tâche partagée d’ADB de la conférence WMT16. Développé sur un moteur de recherche, notre système indexe des sites web bilingues et tente d’identifier les pages anglaises-françaises qui sont en relation de traduction. L’alignement est réalisé grâce à la représentation "sac de mots" et un lexique bilingue. L’outil développé nous a permis d’évaluer plus de 1000 configurations et d’en identifier une qui fournit des performances respectables sur la tâche. Les trois autres articles concernent la tâche d’ILB. Le premier revient sur l’approche dite "standard" et propose une exploration en largeur des paramètres dans le contexte du Web Sémantique. Le deuxième article compare l’approche standard avec les plus récentes techniques basées sur les représentations interlingues de mots (embeddings en anglais) issues de réseaux de neurones. La dernière contribution reporte des performances globales améliorées sur la tâche en combinant, par reclassement supervisée, les sorties des deux types d’approches précédemment étudiées.Statistical models try to generalize knowledge starting from the frequency of probabilistic events in the data. If more data is available, events are more often observed and models are more e cient. Natural Language Processing approaches based on those models are therefore dependant on the quantity and availability of these resources. Thus, there is a permanent need for generating and updating the learning data. This dependency touches Statistical Machine Translation, which requires multilingual resources. This thesis refers to four articles tackling two tasks that contribute signi - cantly to this dependency: the Bilingual Documents Alignment (BDA) and the Bilingual Lexicons Induction (BLI). The rst publication describes the system submitted for the BDA shared task of the WMT16 conference. Developed on a search engine, our system indexes bilingual web sites and tries to identify the English-French pages linked by translation. The alignment is realized using a "bag of words" representation and a bilingual lexicon. The tool we have developed allowed us to evaluate more than 1,000 con gurations and identify one yielding decent performances on this particular task. The three other articles are concerned with the BLI task. The rst one focuses on the so-called standard approach, and proposes a breadth parameter exploration in the Semantic Web context. The second article compares the standard approach with more recent techniques based on interlingual representation of words, or the so-called embeddings, issued from neural networks. The last contribution reports the enhanced global performances on the task, combining the outputs of the two studied approaches through supervised reclassification
    corecore