6 research outputs found
Automatically evaluating the quality of textual descriptions in cultural heritage records
Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library \u201cCultura Italia\u201d and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1 3c 0.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance
Support for Information-Seeking Strategies
Längere Such-Episoden umfassen mehrere Such-Aktionen. Diese Such-Aktionen können in verschiedene Klassen unterteilt werden. Die Klassifikation, die in dieser Arbeit verwendet wird, ist die ISS-Klassifikation von Belkin, Marchetti und Cool, die vier Facetten verwendet (method, goal, mode, resource used), von denen jede zwei Werte hat. Unter der Annahme, dass Support-Mechanismen für jede Klasse bekannt sind, war die Forschungsfrage, ob man jede dieser Klassen durch ein anderes, spezialisiertes Such-Interface unterstützen muss, um eine optimale Unterstützung über unterschiedliche Situationen hinweg zu erreichen, oder ob es reicht, wenn ein einziges Interface Support-Mechanismen für alle denkbaren Such-Aktionen anbietet. Die Forschungsfrage wurde in insgesamt drei Experimenten untersucht.
Die ISS-Klassifikation besteht aus 16 Klassen. Da die Untersuchung der Forschungsfrage für jede dieser 16 Klassen zu aufwändig gewesen wäre, wurden zwei Facetten, goal und resource used, ausgeschlossen. Dadurch blieben zwei Facetten, method und mode, mit insgesamt vier Klassen übrig. Support-Mechanismen für die vier Facetten-Werte, scanning, searching, recognition und specification, wurden gesammelt unter der Annahme, dass diese Mechanismen ebenso unabhängig voneinander sind wie die zugrunde liegenden Facetten.
Der Facetten-Wert recognition wurde in zwei Experimenten untersucht. Das erste Experiment untersuchte eine Tabellen-basierte Ergebnislisten-Darstellung mit einer Listen-basierten Darstellung mit Highlighting bezüglich ihrer Auswirkung auf den Erfolg bei visueller Suche. Versuchsteilnehmer wurden gebeten, Such-Ziele in vorgefertigten Ergebnislisten mit beiden Darstellungs-Varianten, aber nur unter Verwendung visueller Suche, zu finden (Messwiederholung). Ihr Erfolg wurde gemessen anhand der gefundenen Such-Ziele pro Zeit. Weder Liste noch Tabelle zeigten statistisch signifikante Vorteile gegenüber der jeweils anderen Variante. Das zweite Experiment führte eine Baseline-Variante ein, die aus einer herkömmlichen Listen-basierten Darstellung ohne Highlighting bestand. Von dieser Änderung abgesehen, war das Experiment dem ersten recht ähnlich. Auch in diesem Experiment wurde kein statistisch signifikanter Unterschied zwischen den Darstellungs-Varianten gefunden.
Für die anderen Facetten-Werte wurden Support-Mechanismen anhand einer Literatur-Suche identifiziert und im letzten Experiment verwendet.
Die Haupt-Forschungsfrage wurde untersucht anhand von drei verschiedenen Such-Systemen, die sich einander ähnelten und auf dem ezDL-System basierten. Die erste Variante (baseline) war eine sehr vereinfachte Variante des ezDL-Systems, das außer einer Übersetzungs-Einrichtung keine Support-Mechanismen enthielt. Das zweite System war ein adaptives System, das Support-Mechanismen passend zur aktuellen Such-Aktion des Teilnehmers anbot. Das dritte System enthielt alle Support-Mechanismen des zweiten Systems für alle ISS-Klassen auf einmal. Versuchsteilnehmer wurden gebeten, Suchaufgaben mit einem der drei Systeme zu bearbeiten (ohne Messwiederholung). Ihr Erfolg wurde gemessen durch die Anzahl der gefundenen Dokumente pro Zeit. Kein statistisch signifikanter Unterschied zwischen den Systemen wurde gefunden.Longer search episodes comprise multiple search actions. These search actions can be grouped into several classes. The classification used in this work is the ISS classification by Belkin, Marchetti and Cool, which uses four facets (method, goal, mode and resource used), each of which has to values. Assuming that support features for each class are known, the research question was whether it is necessary to support each ISS class by a different search user interface in order to optimally help the user across many situations, or if a single interface can offer support mechanisms for any search action the user is being involved in. The research question was examined in three experiments.
The ISS classification consists of 16 classes. Since studying the research question for all of these classes would have been too difficult, two facets, resource used and learning, were omitted, leaving the two facets method and mode with a total of four remaining classes for examination.
Support mechanisms for each value of the two facets, scanning, searching, recognition, and specification, were gathered, assuming that the support mechanisms are as independent of each other as the underlying facets.
Support features for the facet value recognition was examined in two experiments. The first experiment compared a table-based result list presentation with a list-based one using highlighting in terms of their support for visual search. Participants were asked to locate search targets in manufactured result lists using each of the result list variants solely by means of visual search (within-subjects design). Their success was measured by how many search targets they found per time. Neither list nor table provided a statistically significant advantage. The second experiment added a baseline result list without any support for visual search; apart of this, the experiment was very similar to the first one. Again, none of the studied result list variants showed statistically significant differences to any other. For the other facet values, the support mechanisms were gathered in a literature search, which identified some promising mechanisms which were then used in the last experiment.
The main research question was examined using three search systems that were similar to each other. The first one (baseline) was a very basic variant of the ezDL system and provided no advanced support features other than a translation feature. The second system was an adaptive interface that provided support features only for the ISS class the user was being engaged in. The third system provided all support features of the second system for all ISS classes at once. Participants were asked to complete search tasks with one of the systems (between-subjects design). Their success was measured by how many of the required documents they could locate per time. None of the systems studied provided any statistically significant benefit over any of the other systems
Usage-driven Maintenance of Knowledge Organization Systems
Knowledge Organization Systems (KOS) are typically used as background knowledge
for document indexing in information retrieval. They have to be maintained
and adapted constantly to reflect changes in the domain and the terminology. In
this thesis, approaches are provided that support the maintenance of hierarchical
knowledge organization systems, like thesauri, classifications, or taxonomies, by
making information about the usage of KOS concepts available to the maintainer.
The central contribution is the ICE-Map Visualization, a treemap-based visualization
on top of a generalized statistical framework that is able to visualize almost
arbitrary usage information. The proper selection of an existing KOS for available
documents and the evaluation of a KOS for different indexing techniques by means
of the ICE-Map Visualization is demonstrated.
For the creation of a new KOS, an approach based on crowdsourcing is presented
that uses feedback from Amazon Mechanical Turk to relate terms hierarchically.
The extension of an existing KOS with new terms derived from the documents
to be indexed is performed with a machine-learning approach that relates
the terms to existing concepts in the hierarchy. The features are derived from text
snippets in the result list of a web search engine. For the splitting of overpopulated
concepts into new subconcepts, an interactive clustering approach is presented that
is able to propose names for the new subconcepts.
The implementation of a framework is described that integrates all approaches
of this thesis and contains the reference implementation of the ICE-Map Visualization.
It is extendable and supports the implementation of evaluation methods
that build on other evaluations. Additionally, it supports the visualization of the
results and the implementation of new visualizations. An important building block
for practical applications is the simple linguistic indexer that is presented as minor
contribution. It is knowledge-poor and works without any training.
This thesis applies computer science approaches in the domain of information
science. The introduction describes the foundations in information science; in the
conclusion, the focus is set on the relevance for practical applications, especially
regarding the handling of different qualities of KOSs due to automatic and semiautomatic
maintenance
Monitoramento internacional da produção científica em ciência da informação. volume 1
256 p.Objetivo – Identifi car tendências de publicação de
números temáticos (themed issue; special issue) em periódicos de
ciência da informação. Concepção/ metodologia/ abordagem
– Levantamento de números temáticos indexados em bases de
dados internacionais de ciência da informação/ biblioteconomia,
no período de 2005 / 2010, monitorados no gerenciador de
dados Asksam, para eliminação de duplicatas, derivação de dados
estatísticos; classifi cação dos artigos introdutórios aos números
temáticos de acordo com a Information Science Taxonomy (Donald
T. Hawkins e colaboradores, 2003) e decorrentes análises
sobre estado da arte. Resultados – No período selecionado
foram publicados 185 números temáticos, distribuídos em 11
categorias, com “Pesquisa em ciência da informação” (20%,
37 números temáticos), “Bibliotecas e serviços bibliotecários
(17%, 32 números, com 12 destes sobre ensino e treinamento
em biblioteconomia e ciência da informação) “Tecnologias da
informação” e ”Questões sociais” (14% cada, 26 números),
apresentando-se com as mais representativas do ponto de vista
quantitativo. Originalidade/valor – Mapeamento das tendências
de publicação de números temáticos para elaboração do segundo
produto deste projeto ─, comparação dos resultados deste
primeiro produto com trabalhos apresentados em congressos
de ciência da informação para fi ns de proposição de números
temáticos para a revista Ciência da Informação, editada pelo Ibict