12 research outputs found
Developing Web Archiving Metadata Best Practices to Meet User Needs
The OCLC Research Library Partnership Web Archiving Metadata Working Group was established to meet a widely recognized need for best practices for descriptive metadata for archived websites. The Working Group recognizes that development of successful best practices intended to ensure discoverability requires an understanding of user needs and behavior. We have therefore conducted an extensive literature review to build our knowledge and will issue a white paper summarizing what we have learned. We are also studying existing and emerging approaches to descriptive metadata in this realm and will publish a second report recommending best practices. We will seek broad community input prior to publication
Kyselynkäsittelymenetelmien evaluointitutkimus Suomalaisen verkkoarkiston taivutusmuotoindeksiä käyttäen
Suomen kielen rikas morfologia aiheuttaa tiedonhaulle haasteita. Jotta tiedonhaku on tuloksellista, täytyy kyselyn sanamuoto saada täsmäämään dokumentissa esiintyvän sanamuodon kanssa. Tässä tutkimuksessa verrataan neljän eri kyselynkäsittelymenetelmän tuloksellisuutta dokumenteista rakennetussa taivutusmuotoindeksissä.
Aiempi suomenkielisellä aineistolla toteutettu tiedonhaun evaluointitutkimus on käyttänyt dokumenttikokoelmina pääasiassa lehtiartikkelikokoelmista rakennettuja testikokoelmia. Tässä tutkimuksessa käytetään artikkelikokoelman sijaan Suomalaisesta verkkoarkistosta rakennettua testikokoelmaa, joka sisältää verkkosivuja joiden sisältö ja laatu vaihtelevat paljon. Tutkielmassa verrattavat menetelmät ovat Frequent case generation 3 (FCG3), Simple word ending based rule generator (SWERG+), Snowball-stemmaus yhdistettynä villiin korttiin sekä käsittelemättömät kyselyt.
Tämän tutkimuksen tutkimusmenetelmä on tiedonhaun laboratoriomallin mukainen testaus. Sen suorittamiseksi Suomalaisesta verkkoarkistosta oli rakennettava testikokoelma. Testikokoelmaan valittiin lopulta 16 hakuaihetta, joista muodostetuilla lyhyillä kyselyillä suoritettiin kyselyajot. Ajojen tulokset mitattiin tarkkuudella kymmenen ensimmäisen tulosdokumentin kohdalla sekä kumuloituvan hyödyn mittarilla.
Tutkimuksessa havaittiin FCG3-menetelmän tuottavan perustasona toimineita käsittelemättömiä kyselyitä parempia tuloksia. Sen sijaan aiemmassa tutkimuksessa hyvin suoriutunut SWERG+-menetelmä ei tuottanut tässä tutkimuksessa perustasoa parempia tuloksia. Snowball-stemmaus yhdistettynä villiin korttiin taas tuotti perustasoa heikompia tuloksia
Web Archive Services Framework for Tighter Integration Between the Past and Present Web
Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Much of our cultural discourse occurs primarily on the Web. Thus, Web preservation is a fundamental precondition for multiple disciplines. Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand. Meanwhile, as the sheer volume of data grows on the Web, storytelling is becoming a popular technique in social media for selecting Web resources to support a particular narrative or story .
In this dissertation, we address the problem of understanding the archived collections through proposing the Dark and Stormy Archive (DSA) framework, in which we integrate storytelling social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify.
To inform our work of generating stories from archived collections, we start by building a baseline for the structural characteristics of popular (i.e., receiving the most views) human-generated stories through investigating stories from Storify. Furthermore, we checked the entire population of Archive-It collections for better understanding the characteristics of the collections we intend to summarize. We then filter off-topic pages from the collections the using different methods to detect when an archived page in a collection has gone off-topic. We created a gold standard dataset from three Archive-It collections to evaluate the proposed methods at different thresholds. From the gold standard dataset, we identified five behaviors for the TimeMaps (a list of archived copies of a page) based on the page’s aboutness. Based on a dynamic slicing algorithm, we divide the collection and cluster the pages in each slice. We then select the best representative page from each cluster based on different quality metrics (e.g., the replay quality, and the quality of the generated snippet from the page). At the end, we put the selected pages in chronological order and visualize them using Storify.
For evaluating the DSA framework, we obtained a ground truth dataset of hand-crafted stories from Archive-It collections generated by expert archivists. We used Amazon’s Mechanical Turk to evaluate the automatically generated stories against the stories that were created by domain experts. The results show that the automatically generated stories by the DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated storie
Information search in web archives
Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014Web archives preserve information that was published on the web or digitized from printed publications. Many of that information is unique and historically valuable. However, users do not have dedicated tools to find the desired information, which hampers the usefulness of web archives. This dissertation investigates solutions towards the advance of web archive information retrieval (WAIR) and contributes to the increase of knowledge about its technology and users. The thesis underlying this work is that the search results can be improved by exploiting temporal information intrinsic to web archives. This temporal information was leveraged from two different angles. First, the long-term persistence of web documents was analyzed and modeled to better estimate their relevance to a query. Second, a temporal-dependent ranking framework that learns and combines ranking models specific for each period was devised. This approach contrasts with a typical single-model approach that ignores the variance of web characteristics over time. The proposed approach was empirically validated through various controlled experiments that demonstrated their superiority over the state-of-the-art in WAIR.Os arquivos da web preservam informação que foi publicada na web ou digitalizada de publicações impressas. Muita dessa informação é única e historicamente valiosa. Contudo, os utilizadores não dispõem de ferramentas dedicadas para encontrar a informação desejada, o que limita a utilidade dos arquivos da web. Esta dissertação investiga soluções para o avanço da recuperação de informação em arquivos da web (WAIR) e contribui para o aumento de conhecimento acerca da sua tecnologia e dos seus utilizadores. A tese subjacente a este trabalho é a de que os resultados de pesquisa podem ser melhorados através da exploração de informação temporal intrínseca aos arquivos da web. Esta informação temporal foi explorada de dois ângulos diferentes. Primeiro, a longa persistência dos documentos web foi analisada e modelada para melhor estimar a relevância destes em função da pesquisa. Segundo, foi concebido um enquadramento (framework) para ordenação de resultados dependente do tempo, que aprende e combina modelos específicos para cada período. Esta abordagem contrasta com a abordagem de um modelo único que ignora a variação das características da web ao longo do tempo. A abordagem proposta foi validada empiricamente através de várias experiências controladas que demonstraram a sua superioridade em relação ao estado da arte em WAIR
Estudo da mediação e do uso da informação nos Arquivos Distritais
Tese de doutoramento em Letras, na área de Ciência da Informação Arquivística e Biblioteconómica, na especialidade de Gestão de Serviços de Informação, apresentada à Faculdade de Letras da Universidade de CoimbraA presente tese apresenta um estudo sobre a mediação informacional em
arquivos, sob uma perspetiva da Ciência da Informação, delimitando o seu âmbito
aos arquivos públicos - Arquivos Distritais (ADs) e equiparados - cujo papel é
determinante na consolidação e preservação da memória institucional nacional como
fator de identidade e sentimento de pertença de uma comunidade e na promoção da
cidadania. O conceito de mediação informacional foi delimitado a partir da revisão
crítica da literatura publicada sobre o tema.
Com base ainda na revisão de literatura caracterizamos o percurso histórico
dos arquivos e evolução das políticas arquivísticas nacionais e seu enquadramento
legislativo e fazemos uma clarificação concetual das terminologias usadas no âmbito
das funções dos ADs.
O papel cada vez mais relevante da sociedade da informação altera a
demanda da informação por um público cada vez mais alargado e com competências
tecnológicas mais vastas e desenvolvidas. A crise emergente entre o paradigma
custodial e o paradigma pós-custodial quanto às funções do arquivo e dos
profissionais da informação exige a modificação das práticas de mediação da
informação. Estas já não podem ser realizadas apenas através dos tradicionais
instrumentos de pesquisa normalizados, (guias, inventários, catálogos) mas cada vez
mais através do conhecimento do próprio arquivo como serviço público, pela
assunção por parte dos arquivos de uma nova atitude materializada na difusão
cultural, na extensão educativa e em práticas pedagógicas. Procura-se deste modo
alterar a perceção da comunidade em relação aos próprios arquivos e à informação
que disponibilizam, dando-se a conhecer enquanto memória colectiva, criando assim,
novos públicos e gerando uma interação maior entre o acervo e o utilizador, podendo
construir assim o seu próprio caminho para o acesso à informação. O utilizador é
também mediador da informação.
Reportando-nos à realidade dos ADs portugueses interessa-nos saber se a
perceção que os utilizadores têm da mediação da informação realizada atualmente
corresponde às suas necessidades e expetativas, influencia ou não o seu processo
de acesso à informação, bem como conhecer a perceção que os responsáveis têm da
mediação da informação que praticam.Com esse objetivo realizamos um estudo empírico junto de utilizadores e
responsáveis de AD, recorrendo à metodologia quadripolar, no âmbito da Ciência da
Informação.
Criamos uma amostra na qual foram aplicados dois questionários, um aos
utilizadores e outro aos responsáveis dos ADs. Para completar a recolha de dados
foram ainda realizadas entrevistas presencias a sete responsáveis dos ADs. A análise
dos dados recolhidos permitiu aferir a correspondência entre as expetativas e práticas
na mediação da informação realizada nos ADs, imersos numa crise paradigmática.
O estudo pretende contribuir para o alargamento da reflexão teórica sobre a
Mediação da Informação no âmbito da Ciência da Informação em Portugal e permitir
desenvolver um maior conhecimento da prática dessa função.This thesis presents a study on the informational mediation in archives, under a
perspective from Information Science, delimiting its scope to public archives - District
Archives (ADs) and similar - whose role is decisive in the consolidation and
preservation of national institutional memory as factor of identity and sense of
belonging of a community and in promoting citizenship. The concept of informational
mediation was delimited from the critical review of published literature on the subject.
Also based on the literature review, we characterize the historical background
of the archives and the evolution if the archival national policies and its legislative
framework and we make a conceptual clarification of the terminology used within the
scope of the ADs functions.
The increasingly important role of information society changes the demands of
information by an ever larger audience with wider and more developed technological
skills. The emerging paradigm crisis between custodial and post-custodial paradigms,
regarding the functions of the archives and the information professionals, requires
modification of the information mediation practices. These functions can no longer be
carried out only through the traditional standard search tools (guides, inventories,
catalogs) but increasingly through the knowledge of the archive itself as a public
service, with the assumption of a new attitude embodied in cultural diffusion, in
educational extension and in pedagogical practices by the archives. Thus seeks to
change the perception of the community in relation to the archives and the information
they provide, making it selves known as collective memory, thus creating new
audiences and generating greater interaction between the archive documentation and
the users, being able to build their own path to access to information. The user is also
a mediator of information.
Referring to the reality of the Portuguese ADs, we are interested to know
whether the perception that users have from the mediation of the information currently
performed matches their own needs and expectations, if it influences or not the
process of accessing information, as well as knowing the perception that the heads of
the archives have regarding the mediation of information they practice.
With this aim we conducted an empirical study with users and the heads of
ADs, using the quadripolar methodology within the Information Science scope.We created a sample in which two questionnaires, one for users and one for
the heads of the ADs, were applied. To complete the data collection, face-to-face
interviews were also carried out to seven heads of ADs. The analysis of collected data
allowed verifying the correspondence between the expectations and practices in the
mediation of the information performed by the ADs, immersed in a paradigmatic crisis.
The study aims to contribute to the enlargement of theoretical reflection on
Information Mediation within Information Science in Portugal and help develop a
greater understanding of the practice of this function.This thesis documents a research study about the informational mediation in
archives, from the perspective of Information Science, delimiting its scope to public
archives - District Archives (ADs) and similar. The role of these archives is decisive in
the consolidation and preservation of national institutional memory as factor of identity
and sense of belonging of a community as in promoting citizenship. The concept of
informational mediation was delimited from the critical review of published literature on
the subject.
Also based on the literature review, a characterization of the archives historical
background and the archival national policies evolution and its legislative framework is
presented. A conceptual clarification of the terminology used within the scope of the
ADs duties is made.
Within this work we identify and characterize what we consider a paradigmatic
change (from “custodial” to “post-custodial”) derived from the new demands of
information and the new challenges faced by archives and their professionals. These
challenges, derived from the increasingly important role of information society and
from a larger audience with wider and more developed technological skills, results in
the emerging paradigm crisis, described within this work, and requires, in our opinion,
the modification of the information mediation practices. We consider these practices
currently demand the assumption of the archive itself as a public service, with a new
attitude embodied in cultural diffusion, in educational extension and in pedagogical
practices by the archives. These new practices also require changes and
developments in the information access tools that can no longer be carried out only
through the traditional standard search tools (guides, inventories, catalogs).
We defend, under this new paradigm context, the change of the perception and
the relation of the community within the archives and information they provide,
promoting their diffusion, creating new audiences and generating greater interaction
between the archive documentation and the users. We defend that users should be
able to build their own path to access to information. In this context, the user is also a
mediator of information.
Referring to the reality of the Portuguese ADs, we investigate whether the
perception that users have from the current mediation of the information matches their own needs and expectations and if it has influence or not in the process of accessing
information. We also investigate the perception that the heads of the archives have
regarding the mediation of information practiced in their archives.
With these objectives we conducted an empirical study with users and heads of
ADs, making use of the quadripolar methodology within the Information Science
scope.
A sample was created making use of two questionnaires, one for users and one
for the heads of the ADs. To complete the data collection, face-to-face interviews were
also carried out with seven heads of ADs. The analysis of collected data allowed
verifying the correspondence between the expectations and the real practice of
mediation of the information performed by the ADs, immersed in the referred
paradigmatic crisis.
The study aims to contribute to the enlargement of theoretical reflection on
Information Mediation within Information Science in Portugal and help to develop a
deeper understanding of the practice of this function