25 research outputs found

    Cross-concordances: terminology mapping and its effectiveness for information retrieval

    Get PDF
    The German Federal Ministry for Education and Research funded a major terminology mapping initiative, which found its conclusion in 2007. The task of this terminology mapping initiative was to organize, create and manage 'cross-concordances' between controlled vocabularies (thesauri, classification systems, subject heading lists) centred around the social sciences but quickly extending to other subject areas. 64 crosswalks with more than 500,000 relations were established. In the final phase of the project, a major evaluation effort to test and measure the effectiveness of the vocabulary mappings in an information system environment was conducted. The paper reports on the cross-concordance work and evaluation results.Comment: 19 pages, 4 figures, 11 tables, IFLA conference 200

    Cross-concordances: terminology mapping and its effectiveness for information retrieval

    Get PDF
    The German Federal Ministry for Education and Research funded a major terminology mapping initiative, which found its conclusion in 2007. The task of this terminology mapping initiative was to organize, create and manage ‘cross-concordances’ between controlled vocabularies (thesauri, classification systems, subject heading lists) centred around the social sciences but quickly extending to other subject areas. 64 crosswalks with more than 500,000 relations were established. In the final phase of the project, a major evaluation effort to test and measure the effectiveness of the vocabulary mappings in an information system environment was conducted. The paper reports on the cross-concordance work and evaluation results

    A model for information retrieval driven by conceptual spaces

    Get PDF
    A retrieval model describes the transformation of a query into a set of documents. The question is: what drives this transformation? For semantic information retrieval type of models this transformation is driven by the content and structure of the semantic models. In this case, Knowledge Organization Systems (KOSs) are the semantic models that encode the meaning employed for monolingual and cross-language retrieval. The focus of this research is the relationship between these meanings’ representations and their role and potential in augmenting existing retrieval models effectiveness. The proposed approach is unique in explicitly interpreting a semantic reference as a pointer to a concept in the semantic model that activates all its linked neighboring concepts. It is in fact the formalization of the information retrieval model and the integration of knowledge resources from the Linguistic Linked Open Data cloud that is distinctive from other approaches. The preprocessing of the semantic model using Formal Concept Analysis enables the extraction of conceptual spaces (formal contexts)that are based on sub-graphs from the original structure of the semantic model. The types of conceptual spaces built in this case are limited by the KOSs structural relations relevant to retrieval: exact match, broader, narrower, and related. They capture the definitional and relational aspects of the concepts in the semantic model. Also, each formal context is assigned an operational role in the flow of processes of the retrieval system enabling a clear path towards the implementations of monolingual and cross-lingual systems. By following this model’s theoretical description in constructing a retrieval system, evaluation results have shown statistically significant results in both monolingual and bilingual settings when no methods for query expansion were used. The test suite was run on the Cross-Language Evaluation Forum Domain Specific 2004-2006 collection with additional extensions to match the specifics of this model

    Methods for Matching of Linked Open Social Science Data

    Get PDF
    In recent years, the concept of Linked Open Data (LOD), has gained popularity and acceptance across various communities and domains. Science politics and organizations claim that the potential of semantic technologies and data exposed in this manner may support and enhance research processes and infrastructures providing research information and services. In this thesis, we investigate whether these expectations can be met in the domain of the social sciences. In particular, we analyse and develop methods for matching social scientific data that is published as Linked Data, which we introduce as Linked Open Social Science Data. Based on expert interviews and a prototype application, we investigate the current consumption of LOD in the social sciences and its requirements. Following these insights, we first focus on the complete publication of Linked Open Social Science Data by extending and developing domain-specific ontologies for representing research communities, research data and thesauri. In the second part, methods for matching Linked Open Social Science Data are developed that address particular patterns and characteristics of the data typically used in social research. The results of this work contribute towards enabling a meaningful application of Linked Data in a scientific domain

    Usage-driven Maintenance of Knowledge Organization Systems

    Full text link
    Knowledge Organization Systems (KOS) are typically used as background knowledge for document indexing in information retrieval. They have to be maintained and adapted constantly to reflect changes in the domain and the terminology. In this thesis, approaches are provided that support the maintenance of hierarchical knowledge organization systems, like thesauri, classifications, or taxonomies, by making information about the usage of KOS concepts available to the maintainer. The central contribution is the ICE-Map Visualization, a treemap-based visualization on top of a generalized statistical framework that is able to visualize almost arbitrary usage information. The proper selection of an existing KOS for available documents and the evaluation of a KOS for different indexing techniques by means of the ICE-Map Visualization is demonstrated. For the creation of a new KOS, an approach based on crowdsourcing is presented that uses feedback from Amazon Mechanical Turk to relate terms hierarchically. The extension of an existing KOS with new terms derived from the documents to be indexed is performed with a machine-learning approach that relates the terms to existing concepts in the hierarchy. The features are derived from text snippets in the result list of a web search engine. For the splitting of overpopulated concepts into new subconcepts, an interactive clustering approach is presented that is able to propose names for the new subconcepts. The implementation of a framework is described that integrates all approaches of this thesis and contains the reference implementation of the ICE-Map Visualization. It is extendable and supports the implementation of evaluation methods that build on other evaluations. Additionally, it supports the visualization of the results and the implementation of new visualizations. An important building block for practical applications is the simple linguistic indexer that is presented as minor contribution. It is knowledge-poor and works without any training. This thesis applies computer science approaches in the domain of information science. The introduction describes the foundations in information science; in the conclusion, the focus is set on the relevance for practical applications, especially regarding the handling of different qualities of KOSs due to automatic and semiautomatic maintenance

    LIME: Towards a Metadata Module for Ontolex

    Get PDF
    The OntoLex W3C Community Group has been working for more than a year on realizing a proposal for a standard ontol-ogy lexicon model. As the core-specification of the model is almost com-plete, the group started development of additional modules for specific tasks and use cases. We think that in many usage scenarios (e.g. linguistic enrichment, lo-calization and alignment of ontologies) the discovery and exploitation of linguis-tically grounded datasets may benefit from summarizing information about their linguistic expressivity. While the VoID vocabulary covers the need for general metadata about linked datasets, this more specific information demands a dedicated extension. In this paper, we fill this gap by introducing LIME (Linguistic Metadata), a new vocabulary aiming at completing the OntoLex standard with specifications for linguistic metadata.

    Information Retrieval-Mehrwertdienste fĂŒr Digitale Bibliotheken: Crosskonkordanzen und Bradfordizing

    Full text link
    "Trotz großer Dokumentmengen fĂŒr datenbankĂŒbergreifende Literaturrecherchen erwarten akademische Nutzer einen möglichst hohen Anteil an relevanten und qualitativen Dokumenten in den Trefferergebnissen. Insbesondere die Reihenfolge und Struktur der gelisteten Ergebnisse (Ranking) spielt, neben dem direkten Volltextzugriff auf die Dokumente, inzwischen eine entscheidende Rolle beim Design von Suchsystemen. Nutzer erwarten weiterhin flexible Informationssysteme, die es unter anderem zulassen, Einfluss auf das Ranking der Dokumente zu nehmen bzw. alternative Rankingverfahren zu verwenden. In dieser Arbeit werden zwei Mehrwertverfahren fĂŒr Suchsysteme vorgestellt, die die typischen Probleme bei der Recherche nach wissenschaftlicher Literatur behandeln und damit die Recherchesituation messbar verbessern können. Die beiden Mehrwertdienste semantische HeterogenitĂ€tsbehandlung am Beispiel Crosskonkordanzen und Re-Ranking auf Basis von Bradfordizing, die in unterschiedlichen Phasen der Suche zum Einsatz kommen, werden hier ausfĂŒhrlich beschrieben und im empirischen Teil der Arbeit bzgl. der EffektivitĂ€t fĂŒr typische fachbezogene Recherchen evaluiert. Vorrangiges Ziel der Promotion ist es, zu untersuchen, ob das hier vorgestellte alternative Re-Rankingverfahren Bradfordizing im Anwendungsbereich bibliographischer Datenbanken zum einen operabel ist und zum anderen voraussichtlich gewinnbringend in Informationssystemen eingesetzt und dem Nutzer angeboten werden kann. FĂŒr die Tests wurden Fragestellungen und Daten aus zwei Evaluationsprojekten (CLEF und KoMoHe) verwendet. Die intellektuell bewerteten Dokumente stammen aus insgesamt sieben wissenschaftlichen Fachdatenbanken der FĂ€cher Sozialwissenschaften, Politikwissenschaft, Wirtschaftswissenschaften, Psychologie und Medizin. Die Evaluation der Crosskonkordanzen (insgesamt 82 Fragestellungen) zeigt, dass sich die Retrievalergebnisse signifikant fĂŒr alle Crosskonkordanzen verbessern; es zeigt sich zudem, dass interdisziplinĂ€re Crosskonkordanzen den stĂ€rksten (positiven) Effekt auf die Suchergebnisse haben. Die Evaluation des Re-Ranking nach Bradfordizing (insgesamt 164 Fragestellungen) zeigt, dass die Dokumente der Kernzone (Kernzeitschriften) fĂŒr die meisten Testreihen eine signifikant höhere Precision als Dokumente der Zone 2 und Zone 3 (Peripheriezeitschriften) ergeben. Sowohl fĂŒr Zeitschriften als auch fĂŒr Monographien kann dieser Relevanzvorteil nach Bradfordizing auf einer sehr breiten Basis von Themen und Fragestellungen an zwei unabhĂ€ngigen Dokumentkorpora empirisch nachgewiesen werden." (Autorenreferat)"In spite of huge document sets for cross-database literature searches, academic users expect a high ratio of relevant and qualitative documents in result sets. It is particularly the order and structure of the listed results (ranking) that play an important role when designing search systems alongside the direct full text access for documents. Users also expect flexible information systems which allow influencing the ranking of documents and application of alternative ranking techniques. This thesis proposes two value-added approaches for search systems which treat typical problems in searching scientific literature and seek to improve the retrieval situation on a measurable level. The two value-added services, semantic treatment of heterogeneity (the example of cross-concordances) and re-ranking on Bradfordizing, which are applied in different search phases, are described in detail and their effectiveness in typical subject-specific searches is evaluated in the empirical part of the thesis. The preeminent goal of the thesis is to study if the proposed, alternative re-ranking approach Bradfordizing is operable in the domain of bibliographic databases, and if the approach is profitable, i.e. serves as a value added, for users in information systems. We used topics and data from two evaluation projects (CLEF and KoMoHe) for the tests. The intellectually assessed documents come from seven academic abstracting and indexing databases representing social science, political science, economics, psychology and medicine. The evaluation of the cross-concordances (82 topics altogether) shows that the retrieval results improve significantly for all cross-concordances, indicating that interdisciplinary cross-concordances have the strongest (positive) effect on the search results. The evaluation of Bradfordizing re-ranking (164 topics altogether) shows that core zone (core journals) documents display significantly higher precision than was seen for documents in zone 2 and zone 3 (periphery journals) for most test series. This post-Bradfordizing relevance advantage can be demonstrated empirically across a very broad basis of topics and two independent document corpora as well for journals and monographs." (author's abstract

    Liage de données RDF : évaluation d'approches interlingues

    Get PDF
    The Semantic Web extends the Web by publishing structured and interlinked data using RDF.An RDF data set is a graph where resources are nodes labelled in natural languages. One of the key challenges of linked data is to be able to discover links across RDF data sets. Given two data sets, equivalent resources should be identified and linked by owl:sameAs links. This problem is particularly difficult when resources are described in different natural languages.This thesis investigates the effectiveness of linguistic resources for interlinking RDF data sets. For this purpose, we introduce a general framework in which each RDF resource is represented as a virtual document containing text information of neighboring nodes. The context of a resource are the labels of the neighboring nodes. Once virtual documents are created, they are projected in the same space in order to be compared. This can be achieved by using machine translation or multilingual lexical resources. Once documents are in the same space, similarity measures to find identical resources are applied. Similarity between elements of this space is taken for similarity between RDF resources.We performed evaluation of cross-lingual techniques within the proposed framework. We experimentally evaluate different methods for linking RDF data. In particular, two strategies are explored: applying machine translation or using references to multilingual resources. Overall, evaluation shows the effectiveness of cross-lingual string-based approaches for linking RDF resources expressed in different languages. The methods have been evaluated on resources in English, Chinese, French and German. The best performance (over 0.90 F-measure) was obtained by the machine translation approach. This shows that the similarity-based method can be successfully applied on RDF resources independently of their type (named entities or thesauri concepts). The best experimental results involving just a pair of languages demonstrated the usefulness of such techniques for interlinking RDF resources cross-lingually.Le Web des donnĂ©es Ă©tend le Web en publiant des donnĂ©es structurĂ©es et liĂ©es en RDF. Un jeu de donnĂ©es RDF est un graphe orientĂ© oĂč les ressources peuvent ĂȘtre des sommets Ă©tiquetĂ©es dans des langues naturelles. Un des principaux dĂ©fis est de dĂ©couvrir les liens entre jeux de donnĂ©es RDF. Étant donnĂ©s deux jeux de donnĂ©es, cela consiste Ă  trouver les ressources Ă©quivalentes et les lier avec des liens owl:sameAs. Ce problĂšme est particuliĂšrement difficile lorsque les ressources sont dĂ©crites dans diffĂ©rentes langues naturelles.Cette thĂšse Ă©tudie l'efficacitĂ© des ressources linguistiques pour le liage des donnĂ©es exprimĂ©es dans diffĂ©rentes langues. Chaque ressource RDF est reprĂ©sentĂ©e comme un document virtuel contenant les informations textuelles des sommets voisins. Les Ă©tiquettes des sommets voisins constituent le contexte d'une ressource. Une fois que les documents sont crĂ©Ă©s, ils sont projetĂ©s dans un mĂȘme espace afin d'ĂȘtre comparĂ©s. Ceci peut ĂȘtre rĂ©alisĂ© Ă  l'aide de la traduction automatique ou de ressources lexicales multilingues. Une fois que les documents sont dans le mĂȘme espace, des mesures de similaritĂ© sont appliquĂ©es afin de trouver les ressources identiques. La similaritĂ© entre les documents est prise pour la similaritĂ© entre les ressources RDF.Nous Ă©valuons expĂ©rimentalement diffĂ©rentes mĂ©thodes pour lier les donnĂ©es RDF. En particulier, deux stratĂ©gies sont explorĂ©es: l'application de la traduction automatique et l'usage des banques de donnĂ©es terminologiques et lexicales multilingues. Dans l'ensemble, l'Ă©valuation montre l'efficacitĂ© de ce type d'approches. Les mĂ©thodes ont Ă©tĂ© Ă©valuĂ©es sur les ressources en anglais, chinois, français, et allemand. Les meilleurs rĂ©sultats (F-mesure > 0.90) ont Ă©tĂ© obtenus par la traduction automatique. L'Ă©valuation montre que la mĂ©thode basĂ©e sur la similaritĂ© peut ĂȘtre appliquĂ©e avec succĂšs sur les ressources RDF indĂ©pendamment de leur type (entitĂ©s nommĂ©es ou concepts de dictionnaires)

    LODNav – An Interactive Visualization of the Linking Open Data Cloud

    Get PDF
    The emergence of the Linking Open Data Cloud (LODC) is an example of the adoption of Linked Data principles and the creation of a Web of Data. There is an increasing amount of information linked across member datasets of the LODC by means of RDF links, yet there is little support for a human to understand which datasets are connected to one another. This research presents a novel approach for understanding these interconnections with the publicly accessible tool LODNav – Linking Open Data Navigator. LODNav provides a visualization metaphor of the LODC by positioning member datasets of the LODC on a world map based on the geographical location of the dataset. This interactive tool aims to provide a dynamic up-to-date visualization of the LODC and allows the extraction of information about the datasets as well as their interconnections as RDF data

    Content Enrichment of Digital Libraries: Methods, Technologies and Implementations

    Get PDF
    Parallel to the establishment of the concept of a "digital library", there have been rapid developments in the fields of semantic technologies, information retrieval and artificial intelligence. The idea is to use make use of these three fields to crosslink bibliographic data, i.e., library content, and to enrich it "intelligently" with additional, especially non-library, information. By linking the contents of a library, it is possible to offer users access to semantically similar contents of different digital libraries. For instance, a list of semantically similar publications from completely different subject areas and from different digital libraries can be made accessible. In addition, the user is able to see a wider profile about authors, enriched with information such as biographical details, name alternatives, images, job titles, institute affiliations, etc. This information comes from a wide variety of sources, most of which are not library sources. In order to make such scenarios a reality, this dissertation follows two approaches. The first approach is about crosslinking digital library content in order to offer semantically similar publications based on additional information for a publication. Hence, this approach uses publication-related metadata as a basis. The aligned terms between linked open data repositories/thesauri are considered as an important starting point by considering narrower, broader and related concepts through semantic data models such as SKOS. Information retrieval methods are applied to identify publications with high semantic similarity. For this purpose, approaches of vector space models and "word embedding" are applied and analyzed comparatively. The analyses are performed in digital libraries with different thematic focuses (e.g. economy and agriculture). Using machine learning techniques, metadata is enriched, e.g. with synonyms for content keywords, in order to further improve similarity calculations. To ensure quality, the proposed approaches will be analyzed comparatively with different metadata sets, which will be assessed by experts. Through the combination of different information retrieval methods, the quality of the results can be further improved. This is especially true when user interactions offer possibilities for adjusting the search properties. In the second approach, which this dissertation pursues, author-related data are harvested in order to generate a comprehensive author profile for a digital library. For this purpose, non-library sources, such as linked data repositories (e.g. WIKIDATA) and library sources, such as authority data, are used. If such different sources are used, the disambiguation of author names via the use of already existing persistent identifiers becomes necessary. To this end, we offer an algorithmic approach to disambiguate authors, which makes use of authority data such as the Virtual International Authority File (VIAF). Referring to computer sciences, the methodological value of this dissertation lies in the combination of semantic technologies with methods of information retrieval and artificial intelligence to increase the interoperability between digital libraries and between libraries with non-library sources. By positioning this dissertation as an application-oriented contribution to improve the interoperability, two major contributions are made in the context of digital libraries: (1) The retrieval of information from different Digital Libraries can be made possible via a single access. (2) Existing information about authors is collected from different sources and aggregated into one author profile.Parallel zur Etablierung des Konzepts einer „Digitalen Bibliothek“ gab es rasante Weiterentwicklungen in den Bereichen semantischer Technologien, Information Retrieval und kĂŒnstliche Intelligenz. Die Idee ist es, mit ihrer Hilfe bibliographische Daten, also Inhalte von Bibliotheken, miteinander zu vernetzen und „intelligent“ mit zusĂ€tzlichen, insbesondere nicht-bibliothekarischen Informationen anzureichern. Durch die VerknĂŒpfung von Inhalten einer Bibliothek wird es möglich, einen Zugang fĂŒr Benutzer*innen anzubieten, ĂŒber den semantisch Ă€hnliche Inhalte unterschiedlicher Digitaler Bibliotheken zugĂ€nglich werden. Beispielsweise können hierĂŒber ausgehend von einer bestimmten Publikation eine Liste semantisch Ă€hnlicher Publikationen ggf. aus völlig unterschiedlichen Themenfeldern und aus verschiedenen digitalen Bibliotheken zugĂ€nglich gemacht werden. DarĂŒber hinaus können sich Nutzer*innen ein breiteres Autoren-Profil anzeigen lassen, das mit Informationen wie biographischen Angaben, Namensalternativen, Bildern, Berufsbezeichnung, Instituts-Zugehörigkeiten usw. angereichert ist. Diese Informationen kommen aus unterschiedlichsten und in der Regel nicht-bibliothekarischen Quellen. Um derartige Szenarien RealitĂ€t werden zu lassen, verfolgt diese Dissertation zwei AnsĂ€tze. Der erste Ansatz befasst sich mit der Vernetzung von Inhalten Digitaler Bibliotheken, um auf Basis zusĂ€tzlicher Informationen fĂŒr eine Publikation semantisch Ă€hnliche Publikationen anzubieten. Dieser Ansatz verwendet publikationsbezogene Metadaten als Grundlage. Die verknĂŒpften Begriffe zwischen verlinkten offenen Datenrepositorien/Thesauri werden als wichtiger Angelpunkt betrachtet, indem Unterbegriffe, Oberbegriffe und verwandten Konzepte ĂŒber semantische Datenmodelle, wie SKOS, berĂŒcksichtigt werden. Methoden des Information Retrieval werden angewandt, um v.a. Publikationen mit hoher semantischer Verwandtschaft zu identifizieren. Zu diesem Zweck werden AnsĂ€tze des Vektorraummodells und des „Word Embedding“ eingesetzt und vergleichend analysiert. Die Analysen werden in Digitalen Bibliotheken mit unterschiedlichen thematischen Schwerpunkten (z.B. Wirtschaft und Landwirtschaft) durchgefĂŒhrt. Durch Techniken des maschinellen Lernens werden hierfĂŒr Metadaten angereichert, z.B. mit Synonymen fĂŒr inhaltliche Schlagwörter, um so Ähnlichkeitsberechnungen weiter zu verbessern. Zur Sicherstellung der QualitĂ€t werden die beiden AnsĂ€tze mit verschiedenen MetadatensĂ€tzen vergleichend analysiert wobei die Beurteilung durch Expert*innen erfolgt. Durch die VerknĂŒpfung verschiedener Methoden des Information Retrieval kann die QualitĂ€t der Ergebnisse weiter verbessert werden. Dies trifft insbesondere auch dann zu wenn Benutzerinteraktion Möglichkeiten zur Anpassung der Sucheigenschaften bieten. Im zweiten Ansatz, den diese Dissertation verfolgt, werden autorenbezogene Daten gesammelt, verbunden mit dem Ziel, ein umfassendes Autorenprofil fĂŒr eine Digitale Bibliothek zu generieren. FĂŒr diesen Zweck kommen sowohl nicht-bibliothekarische Quellen, wie Linked Data-Repositorien (z.B. WIKIDATA) und als auch bibliothekarische Quellen, wie Normdatensysteme, zum Einsatz. Wenn solch unterschiedliche Quellen genutzt werden, wird die Disambiguierung von Autorennamen ĂŒber die Nutzung bereits vorhandener persistenter Identifikatoren erforderlich. HierfĂŒr bietet sich ein algorithmischer Ansatz fĂŒr die Disambiguierung von Autoren an, der Normdaten, wie die des Virtual International Authority File (VIAF) nachnutzt. Mit Bezug zur Informatik liegt der methodische Wert dieser Dissertation in der Kombination von semantischen Technologien mit Verfahren des Information Retrievals und der kĂŒnstlichen Intelligenz zur Erhöhung von InteroperabilitĂ€t zwischen Digitalen Bibliotheken und zwischen Bibliotheken und nicht-bibliothekarischen Quellen. Mit der Positionierung dieser Dissertation als anwendungsorientierter Beitrag zur Verbesserung von InteroperabilitĂ€t werden zwei wesentliche BeitrĂ€ge im Kontext Digitaler Bibliotheken geleistet: (1) Die Recherche nach Informationen aus unterschiedlichen Digitalen Bibliotheken kann ĂŒber einen Zugang ermöglicht werden. (2) Vorhandene Informationen ĂŒber Autor*innen werden aus unterschiedlichsten Quellen eingesammelt und zu einem Autorenprofil aggregiert
    corecore