    Publindex: Aweb service to automatically evaluate research publications according to customized criteria

    We introduce Publindex, a system that retrieves, classifies, and returns research publications of a given researcher according to the criteria and in the format predefined by the user

    Building and Using Digital Libraries for ETDs

    Despite the high value of electronic theses and dissertations (ETDs), the global collection has seen limited use. To extend such use, a new approach to building digital libraries (DLs) is needed. Fortunately, recent decades have seen that a vast amount of “gray literature” has become available through a diverse set of institutional repositories as well as regional and national libraries and archives. Most of the works in those collections include ETDs and are often freely available in keeping with the open-access movement, but such access is limited by the services of supporting information systems. As explained through a set of scenarios, ETDs can better meet the needs of diverse stakeholders if customer discovery methods are used to identify personas and user roles as well as their goals and tasks. Hence, DLs, with a rich collection of services, as well as newer, more advanced ones, can be organized so that those services, and expanded workflows building on them, can be adapted to meet personalized goals as well as traditional ones, such as discovery and exploration

    Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data

    In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data

    Content Enrichment of Digital Libraries: Methods, Technologies and Implementations

    Parallel to the establishment of the concept of a "digital library", there have been rapid developments in the fields of semantic technologies, information retrieval and artificial intelligence. The idea is to use make use of these three fields to crosslink bibliographic data, i.e., library content, and to enrich it "intelligently" with additional, especially non-library, information. By linking the contents of a library, it is possible to offer users access to semantically similar contents of different digital libraries. For instance, a list of semantically similar publications from completely different subject areas and from different digital libraries can be made accessible. In addition, the user is able to see a wider profile about authors, enriched with information such as biographical details, name alternatives, images, job titles, institute affiliations, etc. This information comes from a wide variety of sources, most of which are not library sources. In order to make such scenarios a reality, this dissertation follows two approaches. The first approach is about crosslinking digital library content in order to offer semantically similar publications based on additional information for a publication. Hence, this approach uses publication-related metadata as a basis. The aligned terms between linked open data repositories/thesauri are considered as an important starting point by considering narrower, broader and related concepts through semantic data models such as SKOS. Information retrieval methods are applied to identify publications with high semantic similarity. For this purpose, approaches of vector space models and "word embedding" are applied and analyzed comparatively. The analyses are performed in digital libraries with different thematic focuses (e.g. economy and agriculture). Using machine learning techniques, metadata is enriched, e.g. with synonyms for content keywords, in order to further improve similarity calculations. To ensure quality, the proposed approaches will be analyzed comparatively with different metadata sets, which will be assessed by experts. Through the combination of different information retrieval methods, the quality of the results can be further improved. This is especially true when user interactions offer possibilities for adjusting the search properties. In the second approach, which this dissertation pursues, author-related data are harvested in order to generate a comprehensive author profile for a digital library. For this purpose, non-library sources, such as linked data repositories (e.g. WIKIDATA) and library sources, such as authority data, are used. If such different sources are used, the disambiguation of author names via the use of already existing persistent identifiers becomes necessary. To this end, we offer an algorithmic approach to disambiguate authors, which makes use of authority data such as the Virtual International Authority File (VIAF). Referring to computer sciences, the methodological value of this dissertation lies in the combination of semantic technologies with methods of information retrieval and artificial intelligence to increase the interoperability between digital libraries and between libraries with non-library sources. By positioning this dissertation as an application-oriented contribution to improve the interoperability, two major contributions are made in the context of digital libraries: (1) The retrieval of information from different Digital Libraries can be made possible via a single access. (2) Existing information about authors is collected from different sources and aggregated into one author profile.Parallel zur Etablierung des Konzepts einer „Digitalen Bibliothek“ gab es rasante Weiterentwicklungen in den Bereichen semantischer Technologien, Information Retrieval und künstliche Intelligenz. Die Idee ist es, mit ihrer Hilfe bibliographische Daten, also Inhalte von Bibliotheken, miteinander zu vernetzen und „intelligent“ mit zusätzlichen, insbesondere nicht-bibliothekarischen Informationen anzureichern. Durch die Verknüpfung von Inhalten einer Bibliothek wird es möglich, einen Zugang für Benutzer*innen anzubieten, über den semantisch ähnliche Inhalte unterschiedlicher Digitaler Bibliotheken zugänglich werden. Beispielsweise können hierüber ausgehend von einer bestimmten Publikation eine Liste semantisch ähnlicher Publikationen ggf. aus völlig unterschiedlichen Themenfeldern und aus verschiedenen digitalen Bibliotheken zugänglich gemacht werden. Darüber hinaus können sich Nutzer*innen ein breiteres Autoren-Profil anzeigen lassen, das mit Informationen wie biographischen Angaben, Namensalternativen, Bildern, Berufsbezeichnung, Instituts-Zugehörigkeiten usw. angereichert ist. Diese Informationen kommen aus unterschiedlichsten und in der Regel nicht-bibliothekarischen Quellen. Um derartige Szenarien Realität werden zu lassen, verfolgt diese Dissertation zwei Ansätze. Der erste Ansatz befasst sich mit der Vernetzung von Inhalten Digitaler Bibliotheken, um auf Basis zusätzlicher Informationen für eine Publikation semantisch ähnliche Publikationen anzubieten. Dieser Ansatz verwendet publikationsbezogene Metadaten als Grundlage. Die verknüpften Begriffe zwischen verlinkten offenen Datenrepositorien/Thesauri werden als wichtiger Angelpunkt betrachtet, indem Unterbegriffe, Oberbegriffe und verwandten Konzepte über semantische Datenmodelle, wie SKOS, berücksichtigt werden. Methoden des Information Retrieval werden angewandt, um v.a. Publikationen mit hoher semantischer Verwandtschaft zu identifizieren. Zu diesem Zweck werden Ansätze des Vektorraummodells und des „Word Embedding“ eingesetzt und vergleichend analysiert. Die Analysen werden in Digitalen Bibliotheken mit unterschiedlichen thematischen Schwerpunkten (z.B. Wirtschaft und Landwirtschaft) durchgeführt. Durch Techniken des maschinellen Lernens werden hierfür Metadaten angereichert, z.B. mit Synonymen für inhaltliche Schlagwörter, um so Ähnlichkeitsberechnungen weiter zu verbessern. Zur Sicherstellung der Qualität werden die beiden Ansätze mit verschiedenen Metadatensätzen vergleichend analysiert wobei die Beurteilung durch Expert*innen erfolgt. Durch die Verknüpfung verschiedener Methoden des Information Retrieval kann die Qualität der Ergebnisse weiter verbessert werden. Dies trifft insbesondere auch dann zu wenn Benutzerinteraktion Möglichkeiten zur Anpassung der Sucheigenschaften bieten. Im zweiten Ansatz, den diese Dissertation verfolgt, werden autorenbezogene Daten gesammelt, verbunden mit dem Ziel, ein umfassendes Autorenprofil für eine Digitale Bibliothek zu generieren. Für diesen Zweck kommen sowohl nicht-bibliothekarische Quellen, wie Linked Data-Repositorien (z.B. WIKIDATA) und als auch bibliothekarische Quellen, wie Normdatensysteme, zum Einsatz. Wenn solch unterschiedliche Quellen genutzt werden, wird die Disambiguierung von Autorennamen über die Nutzung bereits vorhandener persistenter Identifikatoren erforderlich. Hierfür bietet sich ein algorithmischer Ansatz für die Disambiguierung von Autoren an, der Normdaten, wie die des Virtual International Authority File (VIAF) nachnutzt. Mit Bezug zur Informatik liegt der methodische Wert dieser Dissertation in der Kombination von semantischen Technologien mit Verfahren des Information Retrievals und der künstlichen Intelligenz zur Erhöhung von Interoperabilität zwischen Digitalen Bibliotheken und zwischen Bibliotheken und nicht-bibliothekarischen Quellen. Mit der Positionierung dieser Dissertation als anwendungsorientierter Beitrag zur Verbesserung von Interoperabilität werden zwei wesentliche Beiträge im Kontext Digitaler Bibliotheken geleistet: (1) Die Recherche nach Informationen aus unterschiedlichen Digitalen Bibliotheken kann über einen Zugang ermöglicht werden. (2) Vorhandene Informationen über Autor*innen werden aus unterschiedlichsten Quellen eingesammelt und zu einem Autorenprofil aggregiert

    A relevance feedback approach for the author name disambiguation problem

    Orientadores: Ariadne Maria Brito Rizzoni Carvalho, Ricardo da Silva TorresDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Este trabalho apresenta um novo método semiautomático para desambiguação de nomes que explora a utilização de iterações com realimentação de relevância. Uma etapa não supervisionada é utilizada para definir exemplos puros para o treinamento, e uma etapa híbrida supervisionada é empregada para aprender a função de classificação que irá atribuir autores a referências. O modelo combina um classificador por floresta de caminhos ótimos (OPF - Optimum-Path Forest) com uma função de similaridade complexa gerada por um algoritmo de Programação Genética (PG). As principais contribuições deste trabalho são: (i) proposta de um novo método para desambiguação de nomes de autores; (ii) avaliação em uma nova aplicação, da combinação entre os algoritmos OPF e PG, também conhecida como GOPF (Genetic Programming e Optimum-Path Forest), incrementada por uma etapa de realimentação de relevância; (iii) avaliação do algoritmo do GOPF em um problema de classificação multiclasse; e (iv) adaptação do algoritmo do GOPF para lidar com problemas de classificação de conjunto aberto, isto é, que não possuem todas as classes definidas previamente. O método proposto foi validado em duas coleções tradicionais muito utilizadas para avaliação de métodos de desambiguação de nomes de autores. A primeira é a coleção extraída da DBLP e que possui 4.287 referências associadas a 220 autores distintos; a segunda é chamada de KISTI, gerada pelo Korea Institute of Science Technology Information, e que contém os primeiros 1000 autores mais frequentes na versão do banco de dados da DBLP no final de 2007. Após 5 iterações de realimentação do usuário, nossa abordagem atingiu os melhores resultados para a desambiguação de nomes de autores quando comparado com os outros métodos existentes que utilizam somente as informações básicas da referênciaAbstract: This work presents a new name disambiguation method that exploits user feedback on ambiguous references across iterations. An unsupervised step is used to define pure training samples, and a hybrid supervised step is employed to learn a classification model for assigning references to authors. Our disambiguation method combines the Optimum-Path Forest (OPF) classifier with complex reference similarity functions generated by a Genetic Programming (GP) framework. The main contributions of this work are: (i) proposal of a novel author name desambiguation method; (ii) evaluation in a new application of the combination between GP and OPF algorithms, also known as GOPF, in interaction learning systems; (iii) evaluation of the GOPF algorithm in a multi-class classification problem; and (iv) extension of the GOPF algorithm to handle open-set classification problems, i.e., classification problems in which class samples are not known in advance. The proposed method was validated with two traditional databases largely used for the evaluation of author name disambiguation methods: one is a collection extracted from DBLP which sums up 4,287 references associated with 220 distinct authors; the other is called KISTI and was built by the Korea Institute of Science and Technology Information; it contains the top 1000 most frequent author names from the late-2007 DBLP database. After 5 iterations of relevance feedback, our approach yielded the best results for author name disambiguation when compared with the state-of-the-art methods that just consider basic reference information, such as author names, publication title, and venue titleMestradoCiência da ComputaçãoMestre em Ciência da Computaçã