6 research outputs found

    Automatically evaluating the quality of textual descriptions in cultural heritage records

    Get PDF
    Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library \u201cCultura Italia\u201d and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1 3c 0.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance

    Support for Information-Seeking Strategies

    Get PDF
    Längere Such-Episoden umfassen mehrere Such-Aktionen. Diese Such-Aktionen können in verschiedene Klassen unterteilt werden. Die Klassifikation, die in dieser Arbeit verwendet wird, ist die ISS-Klassifikation von Belkin, Marchetti und Cool, die vier Facetten verwendet (method, goal, mode, resource used), von denen jede zwei Werte hat. Unter der Annahme, dass Support-Mechanismen für jede Klasse bekannt sind, war die Forschungsfrage, ob man jede dieser Klassen durch ein anderes, spezialisiertes Such-Interface unterstützen muss, um eine optimale Unterstützung über unterschiedliche Situationen hinweg zu erreichen, oder ob es reicht, wenn ein einziges Interface Support-Mechanismen für alle denkbaren Such-Aktionen anbietet. Die Forschungsfrage wurde in insgesamt drei Experimenten untersucht. Die ISS-Klassifikation besteht aus 16 Klassen. Da die Untersuchung der Forschungsfrage für jede dieser 16 Klassen zu aufwändig gewesen wäre, wurden zwei Facetten, goal und resource used, ausgeschlossen. Dadurch blieben zwei Facetten, method und mode, mit insgesamt vier Klassen übrig. Support-Mechanismen für die vier Facetten-Werte, scanning, searching, recognition und specification, wurden gesammelt unter der Annahme, dass diese Mechanismen ebenso unabhängig voneinander sind wie die zugrunde liegenden Facetten. Der Facetten-Wert recognition wurde in zwei Experimenten untersucht. Das erste Experiment untersuchte eine Tabellen-basierte Ergebnislisten-Darstellung mit einer Listen-basierten Darstellung mit Highlighting bezüglich ihrer Auswirkung auf den Erfolg bei visueller Suche. Versuchsteilnehmer wurden gebeten, Such-Ziele in vorgefertigten Ergebnislisten mit beiden Darstellungs-Varianten, aber nur unter Verwendung visueller Suche, zu finden (Messwiederholung). Ihr Erfolg wurde gemessen anhand der gefundenen Such-Ziele pro Zeit. Weder Liste noch Tabelle zeigten statistisch signifikante Vorteile gegenüber der jeweils anderen Variante. Das zweite Experiment führte eine Baseline-Variante ein, die aus einer herkömmlichen Listen-basierten Darstellung ohne Highlighting bestand. Von dieser Änderung abgesehen, war das Experiment dem ersten recht ähnlich. Auch in diesem Experiment wurde kein statistisch signifikanter Unterschied zwischen den Darstellungs-Varianten gefunden. Für die anderen Facetten-Werte wurden Support-Mechanismen anhand einer Literatur-Suche identifiziert und im letzten Experiment verwendet. Die Haupt-Forschungsfrage wurde untersucht anhand von drei verschiedenen Such-Systemen, die sich einander ähnelten und auf dem ezDL-System basierten. Die erste Variante (baseline) war eine sehr vereinfachte Variante des ezDL-Systems, das außer einer Übersetzungs-Einrichtung keine Support-Mechanismen enthielt. Das zweite System war ein adaptives System, das Support-Mechanismen passend zur aktuellen Such-Aktion des Teilnehmers anbot. Das dritte System enthielt alle Support-Mechanismen des zweiten Systems für alle ISS-Klassen auf einmal. Versuchsteilnehmer wurden gebeten, Suchaufgaben mit einem der drei Systeme zu bearbeiten (ohne Messwiederholung). Ihr Erfolg wurde gemessen durch die Anzahl der gefundenen Dokumente pro Zeit. Kein statistisch signifikanter Unterschied zwischen den Systemen wurde gefunden.Longer search episodes comprise multiple search actions. These search actions can be grouped into several classes. The classification used in this work is the ISS classification by Belkin, Marchetti and Cool, which uses four facets (method, goal, mode and resource used), each of which has to values. Assuming that support features for each class are known, the research question was whether it is necessary to support each ISS class by a different search user interface in order to optimally help the user across many situations, or if a single interface can offer support mechanisms for any search action the user is being involved in. The research question was examined in three experiments. The ISS classification consists of 16 classes. Since studying the research question for all of these classes would have been too difficult, two facets, resource used and learning, were omitted, leaving the two facets method and mode with a total of four remaining classes for examination. Support mechanisms for each value of the two facets, scanning, searching, recognition, and specification, were gathered, assuming that the support mechanisms are as independent of each other as the underlying facets. Support features for the facet value recognition was examined in two experiments. The first experiment compared a table-based result list presentation with a list-based one using highlighting in terms of their support for visual search. Participants were asked to locate search targets in manufactured result lists using each of the result list variants solely by means of visual search (within-subjects design). Their success was measured by how many search targets they found per time. Neither list nor table provided a statistically significant advantage. The second experiment added a baseline result list without any support for visual search; apart of this, the experiment was very similar to the first one. Again, none of the studied result list variants showed statistically significant differences to any other. For the other facet values, the support mechanisms were gathered in a literature search, which identified some promising mechanisms which were then used in the last experiment. The main research question was examined using three search systems that were similar to each other. The first one (baseline) was a very basic variant of the ezDL system and provided no advanced support features other than a translation feature. The second system was an adaptive interface that provided support features only for the ISS class the user was being engaged in. The third system provided all support features of the second system for all ISS classes at once. Participants were asked to complete search tasks with one of the systems (between-subjects design). Their success was measured by how many of the required documents they could locate per time. None of the systems studied provided any statistically significant benefit over any of the other systems

    Usage-driven Maintenance of Knowledge Organization Systems

    Full text link
    Knowledge Organization Systems (KOS) are typically used as background knowledge for document indexing in information retrieval. They have to be maintained and adapted constantly to reflect changes in the domain and the terminology. In this thesis, approaches are provided that support the maintenance of hierarchical knowledge organization systems, like thesauri, classifications, or taxonomies, by making information about the usage of KOS concepts available to the maintainer. The central contribution is the ICE-Map Visualization, a treemap-based visualization on top of a generalized statistical framework that is able to visualize almost arbitrary usage information. The proper selection of an existing KOS for available documents and the evaluation of a KOS for different indexing techniques by means of the ICE-Map Visualization is demonstrated. For the creation of a new KOS, an approach based on crowdsourcing is presented that uses feedback from Amazon Mechanical Turk to relate terms hierarchically. The extension of an existing KOS with new terms derived from the documents to be indexed is performed with a machine-learning approach that relates the terms to existing concepts in the hierarchy. The features are derived from text snippets in the result list of a web search engine. For the splitting of overpopulated concepts into new subconcepts, an interactive clustering approach is presented that is able to propose names for the new subconcepts. The implementation of a framework is described that integrates all approaches of this thesis and contains the reference implementation of the ICE-Map Visualization. It is extendable and supports the implementation of evaluation methods that build on other evaluations. Additionally, it supports the visualization of the results and the implementation of new visualizations. An important building block for practical applications is the simple linguistic indexer that is presented as minor contribution. It is knowledge-poor and works without any training. This thesis applies computer science approaches in the domain of information science. The introduction describes the foundations in information science; in the conclusion, the focus is set on the relevance for practical applications, especially regarding the handling of different qualities of KOSs due to automatic and semiautomatic maintenance

    Monitoramento internacional da produção científica em ciência da informação. volume 1

    Get PDF
    256 p.Objetivo – Identifi car tendências de publicação de números temáticos (themed issue; special issue) em periódicos de ciência da informação. Concepção/ metodologia/ abordagem – Levantamento de números temáticos indexados em bases de dados internacionais de ciência da informação/ biblioteconomia, no período de 2005 / 2010, monitorados no gerenciador de dados Asksam, para eliminação de duplicatas, derivação de dados estatísticos; classifi cação dos artigos introdutórios aos números temáticos de acordo com a Information Science Taxonomy (Donald T. Hawkins e colaboradores, 2003) e decorrentes análises sobre estado da arte. Resultados – No período selecionado foram publicados 185 números temáticos, distribuídos em 11 categorias, com “Pesquisa em ciência da informação” (20%, 37 números temáticos), “Bibliotecas e serviços bibliotecários (17%, 32 números, com 12 destes sobre ensino e treinamento em biblioteconomia e ciência da informação) “Tecnologias da informação” e ”Questões sociais” (14% cada, 26 números), apresentando-se com as mais representativas do ponto de vista quantitativo. Originalidade/valor – Mapeamento das tendências de publicação de números temáticos para elaboração do segundo produto deste projeto ─, comparação dos resultados deste primeiro produto com trabalhos apresentados em congressos de ciência da informação para fi ns de proposição de números temáticos para a revista Ciência da Informação, editada pelo Ibict
    corecore