23,938 research outputs found
Indexing methods for web archives
There have been numerous efforts recently to digitize previously published content and preserving born-digital content leading to the widespread growth of large text reposi- tories. Web archives are such continuously growing text collections which contain ver- sions of documents spanning over long time periods. Web archives present many op- portunities for historical, cultural and political analyses. Consequently there is a grow- ing need for tools which can efficiently access and search them.
In this work, we are interested in indexing methods for supporting text-search work- loads over web archives like time-travel queries and phrase queries. To this end we make the following contributions:
⢠Time-travel queries are keyword queries with a temporal predicate, e.g., âmpii saarlandâ @ [06/2009], which return versions of documents in the past. We in- troduce a novel index organization strategy, called index sharding, for efficiently supporting time-travel queries without incurring additional index-size blowup. We also propose index-maintenance approaches which scale to such continuously growing collections.
⢠We develop query-optimization techniques for time-travel queries called partition selection which maximizes recall at any given query-execution stage.
⢠We propose indexing methods to support phrase queries, e.g., âto be or not to be that is the questionâ. We index multi-word sequences and devise novel query- optimization methods over the indexed sequences to efficiently answer phrase queries.
We demonstrate the superior performance of our approaches over existing methods by extensive experimentation on real-world web archives.In der jĂźngsten Vergangenheit gab es zahlreiche BemĂźhungen zuvor verĂśffentlichte Inhalte zu digitalisieren und elektronisch erstellte Inhalte zu erhalten. Dies fĂźhrte zu einem weit verbreitenden Anstieg groĂer Textdatenbestände. Webarchive sind eine solche Art konstant ansteigender Textdatensammlung. Sie enthalten mehrere Versionen von Dokumenten, welche sich Ăźber längere Zeiträume erstrecken. DarĂźber hinaus bieten sie viele MĂśglichkeiten fĂźr historische, kulturelle und politische Analysen. Infolgedessen gibt es einen wachsenden Bedarf an Werkzeugen, die eine effiziente Suche in Webarchiven und einen effizienten Zugriff auf die Daten erlauben.
Der Fokus dieser Arbeit liegt auf Indexierungsverfahren, um die Arbeitslast von Textsuche auf Webarchiven zu unterstßtzen, wie zum Beispiel time-travel queries oder phrase queries. Zu diesem Zweck leisten wir folgende Beiträge:
⢠Time-travel queries sind Suchwortanfragen mit einem temporalen Prädikat. Zum Beispiel liefert die Anfrage âmpii saarlandâ @ [06/2009] Versionen des Dokuments aus der Vergangenheit als Ergebnis. Zur effizienten UnterstĂźtzung solcher Anfragen ohne die IndexgrĂśĂe aufzublasen, stellen wir eine neue Strategie zur Organisation von Indizes dar, so genanntes index sharding. Des Weiteren schlagen wir Wartungsverfahren fĂźr Indizes vor, die fĂźr solch konstant wachsende Datensätze skalieren.
⢠WirentwickelnTechnikenzurAnfrageoptimierungvontime-travelqueries, nachstehend partition selection genannt. Diese maximieren den Recall in jeder Phase der Anfrageverarbeitung.
⢠Wir stellen Indexierungsmethoden vor, die phrase queries unterstĂźtzen, z. B. âSein oder Nichtsein, das ist hier die Frageâ. Wir indexieren Sequenzen bestehend aus mehreren WĂśrtern und entwerfen neue Optimierungsverfahren fĂźr die indexierten Sequenzen, um phrase queries effizient zu beantworten. Die Performanz dieser Verfahren wird anhand von ausfĂźhrlichen Experimenten auf realen Webarchiven demonstriert
A cautious partnership: The growing acceptance of folksonomy as a complement to indexing digital images and catalogs
As archives and museums place their photographic collections on the Web, the cost and time of indexing and assigning metadata to these images grows. One potential solution is to allow users to assign metadata to images, a practice known as folksonomy. While detractors label folksonomy as imprecise, sloppy, and overly focused on the needs of individual users, proponents applaud it as being directly tied to users\u27 vocabularies, inexpensive, and a means of directly engaging users. Suggestions for improvements to folksonomy include providing more structure to the tags users can supply, allowing feedback on user supplied tags, and even turning the assignment of metadata into a cooperative online game. Despite limited data on its effectiveness in generating terms relevant to user searches, folksonomy was advocated by the Library of Congress in 2008, and is beginning to be implemented by some libraries as a supplement to their OPAC for users accustomed to searching through Web engines. This paper discusses whether folksonomy can be seen as a substitute for traditional indexing and cataloging methods
Using semantic indexing to improve searching performance in web archives
The sheer volume of electronic documents being published on the Web can be overwhelming for users if the searching aspect is not properly addressed. This problem is particularly acute inside archives and repositories containing large collections of web resources or, more precisely, web pages and other web objects. Using the existing search capabilities in web archives, results can be compromised because of the size of data, content heterogeneity and changes in scientific terminologies and meanings. During the course of this research, we will explore whether semantic web technologies, particularly ontology-based annotation and retrieval, could improve precision in search results in multi-disciplinary web archives
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
Multimedia search without visual analysis: the value of linguistic and contextual information
This paper addresses the focus of this special issue by analyzing the potential contribution of linguistic content and other non-image aspects to the processing of audiovisual data. It summarizes the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. A number of techniques are presented, including the time-alignment of textual resources, audio and speech processing, content reduction and reasoning tools, and the exploitation of surface features
Collaborative tagging as a knowledge organisation and resource discovery tool
The purpose of the paper is to provide an overview of the collaborative tagging phenomenon and explore some of the reasons for its emergence. Design/methodology/approach - The paper reviews the related literature and discusses some of the problems associated with, and the potential of, collaborative tagging approaches for knowledge organisation and general resource discovery. A definition of controlled vocabularies is proposed and used to assess the efficacy of collaborative tagging. An exposition of the collaborative tagging model is provided and a review of the major contributions to the tagging literature is presented. Findings - There are numerous difficulties with collaborative tagging systems (e.g. low precision, lack of collocation, etc.) that originate from the absence of properties that characterise controlled vocabularies. However, such systems can not be dismissed. Librarians and information professionals have lessons to learn from the interactive and social aspects exemplified by collaborative tagging systems, as well as their success in engaging users with information management. The future co-existence of controlled vocabularies and collaborative tagging is predicted, with each appropriate for use within distinct information contexts: formal and informal. Research limitations/implications - Librarians and information professional researchers should be playing a leading role in research aimed at assessing the efficacy of collaborative tagging in relation to information storage, organisation, and retrieval, and to influence the future development of collaborative tagging systems. Practical implications - The paper indicates clear areas where digital libraries and repositories could innovate in order to better engage users with information. Originality/value - At time of writing there were no literature reviews summarising the main contributions to the collaborative tagging research or debate
The Christian Reformed Church Periodical Index: A Local Solution to Indexing Periodicals
This article describes the creation of a web-based database that indexes less well-known periodical titles of importance to scholars in the Christian Reformed Church, and generally not covered by other indexing services. The author explains how the data from the index, originally stored in a card catalog, was moved online to a text-based system, and eventually into its present form in a web-based system. Highlighting some of the challenges that were overcome in creating this resource, brief details are provided on how the data is stored and retrieved in the web environment, on how the data are searched and presented to the researcher, and on the methods used to keep the database current
- âŚ