    Developing Web Archiving Metadata Best Practices to Meet User Needs

    The OCLC Research Library Partnership Web Archiving Metadata Working Group was established to meet a widely recognized need for best practices for descriptive metadata for archived websites. The Working Group recognizes that development of successful best practices intended to ensure discoverability requires an understanding of user needs and behavior. We have therefore conducted an extensive literature review to build our knowledge and will issue a white paper summarizing what we have learned. We are also studying existing and emerging approaches to descriptive metadata in this realm and will publish a second report recommending best practices. We will seek broad community input prior to publication

    Kyselynkäsittelymenetelmien evaluointitutkimus Suomalaisen verkkoarkiston taivutusmuotoindeksiä käyttäen

    Suomen kielen rikas morfologia aiheuttaa tiedonhaulle haasteita. Jotta tiedonhaku on tuloksellista, täytyy kyselyn sanamuoto saada täsmäämään dokumentissa esiintyvän sanamuodon kanssa. Tässä tutkimuksessa verrataan neljän eri kyselynkäsittelymenetelmän tuloksellisuutta dokumenteista rakennetussa taivutusmuotoindeksissä. Aiempi suomenkielisellä aineistolla toteutettu tiedonhaun evaluointitutkimus on käyttänyt dokumenttikokoelmina pääasiassa lehtiartikkelikokoelmista rakennettuja testikokoelmia. Tässä tutkimuksessa käytetään artikkelikokoelman sijaan Suomalaisesta verkkoarkistosta rakennettua testikokoelmaa, joka sisältää verkkosivuja joiden sisältö ja laatu vaihtelevat paljon. Tutkielmassa verrattavat menetelmät ovat Frequent case generation 3 (FCG3), Simple word ending based rule generator (SWERG+), Snowball-stemmaus yhdistettynä villiin korttiin sekä käsittelemättömät kyselyt. Tämän tutkimuksen tutkimusmenetelmä on tiedonhaun laboratoriomallin mukainen testaus. Sen suorittamiseksi Suomalaisesta verkkoarkistosta oli rakennettava testikokoelma. Testikokoelmaan valittiin lopulta 16 hakuaihetta, joista muodostetuilla lyhyillä kyselyillä suoritettiin kyselyajot. Ajojen tulokset mitattiin tarkkuudella kymmenen ensimmäisen tulosdokumentin kohdalla sekä kumuloituvan hyödyn mittarilla. Tutkimuksessa havaittiin FCG3-menetelmän tuottavan perustasona toimineita käsittelemättömiä kyselyitä parempia tuloksia. Sen sijaan aiemmassa tutkimuksessa hyvin suoriutunut SWERG+-menetelmä ei tuottanut tässä tutkimuksessa perustasoa parempia tuloksia. Snowball-stemmaus yhdistettynä villiin korttiin taas tuotti perustasoa heikompia tuloksia

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization

    Using Web Archives to Enrich the Live Web Experience Through Storytelling

    Much of our cultural discourse occurs primarily on the Web. Thus, Web preservation is a fundamental precondition for multiple disciplines. Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand. Meanwhile, as the sheer volume of data grows on the Web, storytelling is becoming a popular technique in social media for selecting Web resources to support a particular narrative or story . In this dissertation, we address the problem of understanding the archived collections through proposing the Dark and Stormy Archive (DSA) framework, in which we integrate storytelling social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify. To inform our work of generating stories from archived collections, we start by building a baseline for the structural characteristics of popular (i.e., receiving the most views) human-generated stories through investigating stories from Storify. Furthermore, we checked the entire population of Archive-It collections for better understanding the characteristics of the collections we intend to summarize. We then filter off-topic pages from the collections the using different methods to detect when an archived page in a collection has gone off-topic. We created a gold standard dataset from three Archive-It collections to evaluate the proposed methods at different thresholds. From the gold standard dataset, we identified five behaviors for the TimeMaps (a list of archived copies of a page) based on the page’s aboutness. Based on a dynamic slicing algorithm, we divide the collection and cluster the pages in each slice. We then select the best representative page from each cluster based on different quality metrics (e.g., the replay quality, and the quality of the generated snippet from the page). At the end, we put the selected pages in chronological order and visualize them using Storify. For evaluating the DSA framework, we obtained a ground truth dataset of hand-crafted stories from Archive-It collections generated by expert archivists. We used Amazon’s Mechanical Turk to evaluate the automatically generated stories against the stories that were created by domain experts. The results show that the automatically generated stories by the DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated storie

    Information search in web archives

    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014Web archives preserve information that was published on the web or digitized from printed publications. Many of that information is unique and historically valuable. However, users do not have dedicated tools to find the desired information, which hampers the usefulness of web archives. This dissertation investigates solutions towards the advance of web archive information retrieval (WAIR) and contributes to the increase of knowledge about its technology and users. The thesis underlying this work is that the search results can be improved by exploiting temporal information intrinsic to web archives. This temporal information was leveraged from two different angles. First, the long-term persistence of web documents was analyzed and modeled to better estimate their relevance to a query. Second, a temporal-dependent ranking framework that learns and combines ranking models specific for each period was devised. This approach contrasts with a typical single-model approach that ignores the variance of web characteristics over time. The proposed approach was empirically validated through various controlled experiments that demonstrated their superiority over the state-of-the-art in WAIR.Os arquivos da web preservam informação que foi publicada na web ou digitalizada de publicações impressas. Muita dessa informação é única e historicamente valiosa. Contudo, os utilizadores não dispõem de ferramentas dedicadas para encontrar a informação desejada, o que limita a utilidade dos arquivos da web. Esta dissertação investiga soluções para o avanço da recuperação de informação em arquivos da web (WAIR) e contribui para o aumento de conhecimento acerca da sua tecnologia e dos seus utilizadores. A tese subjacente a este trabalho é a de que os resultados de pesquisa podem ser melhorados através da exploração de informação temporal intrínseca aos arquivos da web. Esta informação temporal foi explorada de dois ângulos diferentes. Primeiro, a longa persistência dos documentos web foi analisada e modelada para melhor estimar a relevância destes em função da pesquisa. Segundo, foi concebido um enquadramento (framework) para ordenação de resultados dependente do tempo, que aprende e combina modelos específicos para cada período. Esta abordagem contrasta com a abordagem de um modelo único que ignora a variação das características da web ao longo do tempo. A abordagem proposta foi validada empiricamente através de várias experiências controladas que demonstraram a sua superioridade em relação ao estado da arte em WAIR

    Estudo da mediação e do uso da informação nos Arquivos Distritais

    Tese de doutoramento em Letras, na área de Ciência da Informação Arquivística e Biblioteconómica, na especialidade de Gestão de Serviços de Informação, apresentada à Faculdade de Letras da Universidade de CoimbraA presente tese apresenta um estudo sobre a mediação informacional em arquivos, sob uma perspetiva da Ciência da Informação, delimitando o seu âmbito aos arquivos públicos - Arquivos Distritais (ADs) e equiparados - cujo papel é determinante na consolidação e preservação da memória institucional nacional como fator de identidade e sentimento de pertença de uma comunidade e na promoção da cidadania. O conceito de mediação informacional foi delimitado a partir da revisão crítica da literatura publicada sobre o tema. Com base ainda na revisão de literatura caracterizamos o percurso histórico dos arquivos e evolução das políticas arquivísticas nacionais e seu enquadramento legislativo e fazemos uma clarificação concetual das terminologias usadas no âmbito das funções dos ADs. O papel cada vez mais relevante da sociedade da informação altera a demanda da informação por um público cada vez mais alargado e com competências tecnológicas mais vastas e desenvolvidas. A crise emergente entre o paradigma custodial e o paradigma pós-custodial quanto às funções do arquivo e dos profissionais da informação exige a modificação das práticas de mediação da informação. Estas já não podem ser realizadas apenas através dos tradicionais instrumentos de pesquisa normalizados, (guias, inventários, catálogos) mas cada vez mais através do conhecimento do próprio arquivo como serviço público, pela assunção por parte dos arquivos de uma nova atitude materializada na difusão cultural, na extensão educativa e em práticas pedagógicas. Procura-se deste modo alterar a perceção da comunidade em relação aos próprios arquivos e à informação que disponibilizam, dando-se a conhecer enquanto memória colectiva, criando assim, novos públicos e gerando uma interação maior entre o acervo e o utilizador, podendo construir assim o seu próprio caminho para o acesso à informação. O utilizador é também mediador da informação. Reportando-nos à realidade dos ADs portugueses interessa-nos saber se a perceção que os utilizadores têm da mediação da informação realizada atualmente corresponde às suas necessidades e expetativas, influencia ou não o seu processo de acesso à informação, bem como conhecer a perceção que os responsáveis têm da mediação da informação que praticam.Com esse objetivo realizamos um estudo empírico junto de utilizadores e responsáveis de AD, recorrendo à metodologia quadripolar, no âmbito da Ciência da Informação. Criamos uma amostra na qual foram aplicados dois questionários, um aos utilizadores e outro aos responsáveis dos ADs. Para completar a recolha de dados foram ainda realizadas entrevistas presencias a sete responsáveis dos ADs. A análise dos dados recolhidos permitiu aferir a correspondência entre as expetativas e práticas na mediação da informação realizada nos ADs, imersos numa crise paradigmática. O estudo pretende contribuir para o alargamento da reflexão teórica sobre a Mediação da Informação no âmbito da Ciência da Informação em Portugal e permitir desenvolver um maior conhecimento da prática dessa função.This thesis presents a study on the informational mediation in archives, under a perspective from Information Science, delimiting its scope to public archives - District Archives (ADs) and similar - whose role is decisive in the consolidation and preservation of national institutional memory as factor of identity and sense of belonging of a community and in promoting citizenship. The concept of informational mediation was delimited from the critical review of published literature on the subject. Also based on the literature review, we characterize the historical background of the archives and the evolution if the archival national policies and its legislative framework and we make a conceptual clarification of the terminology used within the scope of the ADs functions. The increasingly important role of information society changes the demands of information by an ever larger audience with wider and more developed technological skills. The emerging paradigm crisis between custodial and post-custodial paradigms, regarding the functions of the archives and the information professionals, requires modification of the information mediation practices. These functions can no longer be carried out only through the traditional standard search tools (guides, inventories, catalogs) but increasingly through the knowledge of the archive itself as a public service, with the assumption of a new attitude embodied in cultural diffusion, in educational extension and in pedagogical practices by the archives. Thus seeks to change the perception of the community in relation to the archives and the information they provide, making it selves known as collective memory, thus creating new audiences and generating greater interaction between the archive documentation and the users, being able to build their own path to access to information. The user is also a mediator of information. Referring to the reality of the Portuguese ADs, we are interested to know whether the perception that users have from the mediation of the information currently performed matches their own needs and expectations, if it influences or not the process of accessing information, as well as knowing the perception that the heads of the archives have regarding the mediation of information they practice. With this aim we conducted an empirical study with users and the heads of ADs, using the quadripolar methodology within the Information Science scope.We created a sample in which two questionnaires, one for users and one for the heads of the ADs, were applied. To complete the data collection, face-to-face interviews were also carried out to seven heads of ADs. The analysis of collected data allowed verifying the correspondence between the expectations and practices in the mediation of the information performed by the ADs, immersed in a paradigmatic crisis. The study aims to contribute to the enlargement of theoretical reflection on Information Mediation within Information Science in Portugal and help develop a greater understanding of the practice of this function.This thesis documents a research study about the informational mediation in archives, from the perspective of Information Science, delimiting its scope to public archives - District Archives (ADs) and similar. The role of these archives is decisive in the consolidation and preservation of national institutional memory as factor of identity and sense of belonging of a community as in promoting citizenship. The concept of informational mediation was delimited from the critical review of published literature on the subject. Also based on the literature review, a characterization of the archives historical background and the archival national policies evolution and its legislative framework is presented. A conceptual clarification of the terminology used within the scope of the ADs duties is made. Within this work we identify and characterize what we consider a paradigmatic change (from “custodial” to “post-custodial”) derived from the new demands of information and the new challenges faced by archives and their professionals. These challenges, derived from the increasingly important role of information society and from a larger audience with wider and more developed technological skills, results in the emerging paradigm crisis, described within this work, and requires, in our opinion, the modification of the information mediation practices. We consider these practices currently demand the assumption of the archive itself as a public service, with a new attitude embodied in cultural diffusion, in educational extension and in pedagogical practices by the archives. These new practices also require changes and developments in the information access tools that can no longer be carried out only through the traditional standard search tools (guides, inventories, catalogs). We defend, under this new paradigm context, the change of the perception and the relation of the community within the archives and information they provide, promoting their diffusion, creating new audiences and generating greater interaction between the archive documentation and the users. We defend that users should be able to build their own path to access to information. In this context, the user is also a mediator of information. Referring to the reality of the Portuguese ADs, we investigate whether the perception that users have from the current mediation of the information matches their own needs and expectations and if it has influence or not in the process of accessing information. We also investigate the perception that the heads of the archives have regarding the mediation of information practiced in their archives. With these objectives we conducted an empirical study with users and heads of ADs, making use of the quadripolar methodology within the Information Science scope. A sample was created making use of two questionnaires, one for users and one for the heads of the ADs. To complete the data collection, face-to-face interviews were also carried out with seven heads of ADs. The analysis of collected data allowed verifying the correspondence between the expectations and the real practice of mediation of the information performed by the ADs, immersed in the referred paradigmatic crisis. The study aims to contribute to the enlargement of theoretical reflection on Information Mediation within Information Science in Portugal and help to develop a deeper understanding of the practice of this function