3,253 research outputs found

    A document management methodology based on similarity contents

    Get PDF
    The advent of the WWW and distributed information systems have made it possible to share documents between different users and organisations. However, this has created many problems related to the security, accessibility, right and most importantly the consistency of documents. It is important that the people involved in the documents management process have access to the most up-to-date version of documents, retrieve the correct documents and should be able to update the documents repository in such a way that his or her document are known to others. In this paper we propose a method for organising, storing and retrieving documents based on similarity contents. The method uses techniques based on information retrieval, document indexation and term extraction and indexing. This methodology is developed for the E-Cognos project which aims at developing tools for the management and sharing of documents in the construction domain

    Cross-concordances: terminology mapping and its effectiveness for information retrieval

    Get PDF
    The German Federal Ministry for Education and Research funded a major terminology mapping initiative, which found its conclusion in 2007. The task of this terminology mapping initiative was to organize, create and manage 'cross-concordances' between controlled vocabularies (thesauri, classification systems, subject heading lists) centred around the social sciences but quickly extending to other subject areas. 64 crosswalks with more than 500,000 relations were established. In the final phase of the project, a major evaluation effort to test and measure the effectiveness of the vocabulary mappings in an information system environment was conducted. The paper reports on the cross-concordance work and evaluation results.Comment: 19 pages, 4 figures, 11 tables, IFLA conference 200

    An effective, low-cost measure of semantic relatedness obtained from Wikipedia links

    Get PDF
    This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Out approach is unique in that it does so using the hyperlink structure of Wikipedia rather than its category hierarchy or textual content. Evaluation with manually defined measures of semantic relatedness reveals this to be an effective compromise between the ease of computation of the former approach and the accuracy of the latter

    From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

    Get PDF
    Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence Researc

    TatWordNet: A Linguistic Linked Open Data-Integrated WordNet Resource for Tatar

    Get PDF
    We present the first release of TatWordNet (http://wordnet.tatar), a wordnet resource for Tatar. TatWordNet has been constructed by the combination of the expand and the merge approaches. The synsets of TatWordNet have been compiled by: (i) the automatic conversion of concepts of TatThes, a socio-political Tatar; (ii) semi-automatic translation of synsets of RuWordNet, a wordnet resource for Russian with the followed manual verification and correction; (iii) manual translation of base RuWordNet synsets; (iv) and manual translation of the all hypernyms of the previously translated RuWordNet synsets. The currents version of TatWordNet contains 18,583 synsets, 36,540 lexical entries and 49,525 senses. The resource has been published to the Linguistic Linked Open Data cloud and interlinked with the Global WordNet Grid

    Content Enrichment of Digital Libraries: Methods, Technologies and Implementations

    Get PDF
    Parallel to the establishment of the concept of a "digital library", there have been rapid developments in the fields of semantic technologies, information retrieval and artificial intelligence. The idea is to use make use of these three fields to crosslink bibliographic data, i.e., library content, and to enrich it "intelligently" with additional, especially non-library, information. By linking the contents of a library, it is possible to offer users access to semantically similar contents of different digital libraries. For instance, a list of semantically similar publications from completely different subject areas and from different digital libraries can be made accessible. In addition, the user is able to see a wider profile about authors, enriched with information such as biographical details, name alternatives, images, job titles, institute affiliations, etc. This information comes from a wide variety of sources, most of which are not library sources. In order to make such scenarios a reality, this dissertation follows two approaches. The first approach is about crosslinking digital library content in order to offer semantically similar publications based on additional information for a publication. Hence, this approach uses publication-related metadata as a basis. The aligned terms between linked open data repositories/thesauri are considered as an important starting point by considering narrower, broader and related concepts through semantic data models such as SKOS. Information retrieval methods are applied to identify publications with high semantic similarity. For this purpose, approaches of vector space models and "word embedding" are applied and analyzed comparatively. The analyses are performed in digital libraries with different thematic focuses (e.g. economy and agriculture). Using machine learning techniques, metadata is enriched, e.g. with synonyms for content keywords, in order to further improve similarity calculations. To ensure quality, the proposed approaches will be analyzed comparatively with different metadata sets, which will be assessed by experts. Through the combination of different information retrieval methods, the quality of the results can be further improved. This is especially true when user interactions offer possibilities for adjusting the search properties. In the second approach, which this dissertation pursues, author-related data are harvested in order to generate a comprehensive author profile for a digital library. For this purpose, non-library sources, such as linked data repositories (e.g. WIKIDATA) and library sources, such as authority data, are used. If such different sources are used, the disambiguation of author names via the use of already existing persistent identifiers becomes necessary. To this end, we offer an algorithmic approach to disambiguate authors, which makes use of authority data such as the Virtual International Authority File (VIAF). Referring to computer sciences, the methodological value of this dissertation lies in the combination of semantic technologies with methods of information retrieval and artificial intelligence to increase the interoperability between digital libraries and between libraries with non-library sources. By positioning this dissertation as an application-oriented contribution to improve the interoperability, two major contributions are made in the context of digital libraries: (1) The retrieval of information from different Digital Libraries can be made possible via a single access. (2) Existing information about authors is collected from different sources and aggregated into one author profile.Parallel zur Etablierung des Konzepts einer „Digitalen Bibliothek“ gab es rasante Weiterentwicklungen in den Bereichen semantischer Technologien, Information Retrieval und künstliche Intelligenz. Die Idee ist es, mit ihrer Hilfe bibliographische Daten, also Inhalte von Bibliotheken, miteinander zu vernetzen und „intelligent“ mit zusätzlichen, insbesondere nicht-bibliothekarischen Informationen anzureichern. Durch die Verknüpfung von Inhalten einer Bibliothek wird es möglich, einen Zugang für Benutzer*innen anzubieten, über den semantisch ähnliche Inhalte unterschiedlicher Digitaler Bibliotheken zugänglich werden. Beispielsweise können hierüber ausgehend von einer bestimmten Publikation eine Liste semantisch ähnlicher Publikationen ggf. aus völlig unterschiedlichen Themenfeldern und aus verschiedenen digitalen Bibliotheken zugänglich gemacht werden. Darüber hinaus können sich Nutzer*innen ein breiteres Autoren-Profil anzeigen lassen, das mit Informationen wie biographischen Angaben, Namensalternativen, Bildern, Berufsbezeichnung, Instituts-Zugehörigkeiten usw. angereichert ist. Diese Informationen kommen aus unterschiedlichsten und in der Regel nicht-bibliothekarischen Quellen. Um derartige Szenarien Realität werden zu lassen, verfolgt diese Dissertation zwei Ansätze. Der erste Ansatz befasst sich mit der Vernetzung von Inhalten Digitaler Bibliotheken, um auf Basis zusätzlicher Informationen für eine Publikation semantisch ähnliche Publikationen anzubieten. Dieser Ansatz verwendet publikationsbezogene Metadaten als Grundlage. Die verknüpften Begriffe zwischen verlinkten offenen Datenrepositorien/Thesauri werden als wichtiger Angelpunkt betrachtet, indem Unterbegriffe, Oberbegriffe und verwandten Konzepte über semantische Datenmodelle, wie SKOS, berücksichtigt werden. Methoden des Information Retrieval werden angewandt, um v.a. Publikationen mit hoher semantischer Verwandtschaft zu identifizieren. Zu diesem Zweck werden Ansätze des Vektorraummodells und des „Word Embedding“ eingesetzt und vergleichend analysiert. Die Analysen werden in Digitalen Bibliotheken mit unterschiedlichen thematischen Schwerpunkten (z.B. Wirtschaft und Landwirtschaft) durchgeführt. Durch Techniken des maschinellen Lernens werden hierfür Metadaten angereichert, z.B. mit Synonymen für inhaltliche Schlagwörter, um so Ähnlichkeitsberechnungen weiter zu verbessern. Zur Sicherstellung der Qualität werden die beiden Ansätze mit verschiedenen Metadatensätzen vergleichend analysiert wobei die Beurteilung durch Expert*innen erfolgt. Durch die Verknüpfung verschiedener Methoden des Information Retrieval kann die Qualität der Ergebnisse weiter verbessert werden. Dies trifft insbesondere auch dann zu wenn Benutzerinteraktion Möglichkeiten zur Anpassung der Sucheigenschaften bieten. Im zweiten Ansatz, den diese Dissertation verfolgt, werden autorenbezogene Daten gesammelt, verbunden mit dem Ziel, ein umfassendes Autorenprofil für eine Digitale Bibliothek zu generieren. Für diesen Zweck kommen sowohl nicht-bibliothekarische Quellen, wie Linked Data-Repositorien (z.B. WIKIDATA) und als auch bibliothekarische Quellen, wie Normdatensysteme, zum Einsatz. Wenn solch unterschiedliche Quellen genutzt werden, wird die Disambiguierung von Autorennamen über die Nutzung bereits vorhandener persistenter Identifikatoren erforderlich. Hierfür bietet sich ein algorithmischer Ansatz für die Disambiguierung von Autoren an, der Normdaten, wie die des Virtual International Authority File (VIAF) nachnutzt. Mit Bezug zur Informatik liegt der methodische Wert dieser Dissertation in der Kombination von semantischen Technologien mit Verfahren des Information Retrievals und der künstlichen Intelligenz zur Erhöhung von Interoperabilität zwischen Digitalen Bibliotheken und zwischen Bibliotheken und nicht-bibliothekarischen Quellen. Mit der Positionierung dieser Dissertation als anwendungsorientierter Beitrag zur Verbesserung von Interoperabilität werden zwei wesentliche Beiträge im Kontext Digitaler Bibliotheken geleistet: (1) Die Recherche nach Informationen aus unterschiedlichen Digitalen Bibliotheken kann über einen Zugang ermöglicht werden. (2) Vorhandene Informationen über Autor*innen werden aus unterschiedlichsten Quellen eingesammelt und zu einem Autorenprofil aggregiert

    A Survey of Volunteered Open Geo-Knowledge Bases in the Semantic Web

    Full text link
    Over the past decade, rapid advances in web technologies, coupled with innovative models of spatial data collection and consumption, have generated a robust growth in geo-referenced information, resulting in spatial information overload. Increasing 'geographic intelligence' in traditional text-based information retrieval has become a prominent approach to respond to this issue and to fulfill users' spatial information needs. Numerous efforts in the Semantic Geospatial Web, Volunteered Geographic Information (VGI), and the Linking Open Data initiative have converged in a constellation of open knowledge bases, freely available online. In this article, we survey these open knowledge bases, focusing on their geospatial dimension. Particular attention is devoted to the crucial issue of the quality of geo-knowledge bases, as well as of crowdsourced data. A new knowledge base, the OpenStreetMap Semantic Network, is outlined as our contribution to this area. Research directions in information integration and Geographic Information Retrieval (GIR) are then reviewed, with a critical discussion of their current limitations and future prospects

    Lexical Knowledge Extraction: an Effective Approach to Schema and Ontology Matching

    Get PDF
    This paper’s aim is to examine what role Lexical Knowledge Extraction plays in data integration as well as ontology engineering.Data integration is the problem of combining data residing at distributed heterogeneous sources, and providing the user with a unified view of these data; a common and important scenario in data integration are structured or semi-structure data sources described by a schema.Ontology engineering is a subfield of knowledge engineering that studies the methodologies for building and maintaining ontologies. Ontology engineering offers a direction towards solving the interoperability problems brought about by semantic obstacles, such as the obstacles related to the definitions of business terms and software classes. In these contexts where users are confronted with heterogeneous information it is crucial the support of matching techniques. Matching techniques aim at finding correspondences between semantically related entities of different schemata/ontologies.Several matching techniques have been proposed in the literature based on different approaches, often derived from other fields, such as text similarity, graph comparison and machine learning.This paper proposes a matching technique based on Lexical Knowledge Extraction: first, an Automatic Lexical Annotation of schemata/ontologies is performed, then lexical relationships are extracted based on such annotations.Lexical Annotation is a piece of information added in a document (book, online record, video, or other data), that refers to a semantic resource such as WordNet. Each annotation has the property to own one or more lexical descriptions. Lexical annotation is performed by the Probabilistic Word Sense Disambiguation (PWSD) method that combines several disambiguation algorithms.Our hypothesis is that performing lexical annotation of elements (e.g. classes and properties/attributes) of schemata/ontologies makes the system able to automatically extract the lexical knowledge that is implicit in a schema/ontology and then to derive lexical relationships between the elements of a schema/ontology or among elements of different schemata/ontologies.The effectiveness of the method presented in this paper has been proven within the data integration system MOMIS
    corecore