10 research outputs found

    On the interplay between search behavior and collections in digital libraries and archives

    Get PDF
    Log analysis is an unobtrusive technique used to better understand search behavior and evaluate search systems. However, in contrast with open web search, in a vertical search system such as a digital library or media archive the collection is known and central to its purpose. This drives different, more collection-oriented questions when studying the logs. For example, whether users need different support in different parts of the collection.In a digital library, the collection is categorized using professionally curated metadata. We conjecture that using this metadata can improve and extend the methods and techniques for log analysis. We investigate how to identify different types of search behavior using the metadata explicitly, how to explain and predict user interactions for the different types of behavior found, and finally how to communicate our research results to domain experts.</p

    A Content-Addressable Network for Similarity Search in Metric Spaces

    Get PDF
    Because of the ongoing digital data explosion, more advanced search paradigms than the traditional exact match are needed for contentbased retrieval in huge and ever growing collections of data produced in application areas such as multimedia, molecular biology, marketing, computer-aided design and purchasing assistance. As the variety of data types is fast going towards creating a database utilized by people, the computer systems must be able to model human fundamental reasoning paradigms, which are naturally based on similarity. The ability to perceive similarities is crucial for recognition, classification, and learning, and it plays an important role in scientific discovery and creativity. Recently, the mathematical notion of metric space has become a useful abstraction of similarity and many similarity search indexes have been developed. In this thesis, we accept the metric space similarity paradigm and concentrate on the scalability issues. By exploiting computer networks and applying the Peer-to-Peer communication paradigms, we build a structured network of computers able to process similarity queries in parallel. Since no centralized entities are used, such architectures are fully scalable. Specifically, we propose a Peer-to-Peer system for similarity search in metric spaces called Metric Content-Addressable Network (MCAN) which is an extension of the well known Content-Addressable Network (CAN) used for hash lookup. A prototype implementation of MCAN was tested on real-life datasets of image features, protein symbols, and text — observed results are reported. We also compared the performance of MCAN with three other, recently proposed, distributed data structures for similarity search in metric spaces

    Seventh Biennial Report : June 2003 - March 2005

    No full text

    Die Sphere-Search-Suchmaschine zur graphbasierten Suche auf heterogenen, semistrukturierten Daten

    Get PDF
    In dieser Arbeit wird die neuartige SphereSearch-Suchmaschine vorgestellt, die ein einheitliches ranglistenbasiertes Retrieval auf heterogenen XML- und Web-Daten ermöglicht. Ihre Fähigkeiten umfassen die Auswertung von vagen Struktur- und Inhaltsbedingungen sowie ein auf IR-Statistiken und einem graph-basierten Datenmodell basierendes Relevanz-Ranking. Web-Dokumente im HTML- und PDFFormat werden zunächst automatisch in ein XML-Zwischenformat konvertiert und anschließend mit Hilfe von Annotations-Tools durch zusätzliche Tags semantisch angereichtert. Die graph-basierte Suchmaschine bietet auf semi-strukturierten Daten vielfältige Suchmöglichkeiten, die von keiner herkömmlichen Web- oder XMLSuchmaschine ausgedrückt werden können: konzeptbewusste und kontextbewusste Suche, die sowohl die implizite Struktur von Daten als auch ihren Kontext berücksichtigt. Die Vorteile der SphereSearch-Suchmaschine werden durch Experimente auf verschiedenen Dokumentenkorpora demonstriert. Diese umfassen eine große, vielfältige Tags beinhaltende, nicht-schematische Enzyklopädie, die um externe Dokumente erweitert wurde, sowie einen Standard-XML-Benchmark.This thesis presents the novel SphereSearch Engine that provides unified ranked retrieval on heterogeneous XML andWeb data. Its search capabilities include vague structure and text content conditions, and relevance ranking based on IR statistics and a graph-based data model. Web pages in HTML or PDF are automatically converted into an intermediate XML format, with the option of generating semantic tags by means of linguistic annotation tools. For semi-structured data the graphbased query engine is leveraged to provide very rich search options that cannot be expressed in traditional Web or XML search engines: concept-aware and linkaware querying that takes into account the implicit structure and context of Web pages. The benefits of the SphereSearch engine are demonstrated by experiments with a large and richly tagged but non-schematic open encyclopedia extended with external documents and a standard XML benchmark

    Clustering-based Labelling Scheme - A Hybrid Approach for Efficient Querying and Updating XML Documents

    Get PDF
    Extensible Markup Language (XML) has become a dominant technology for transferring data through the worldwide web. The XML labelling schemes play a key role in handling XML data efficiently and robustly. Thus, many labelling schemes have been proposed. However, these labelling schemes have limitations and shortcomings. Thus, the aim of this research was to investigate the existing XML labelling schemes and their limitations in order to address the issue of efficiency of XML query performance. This thesis investigated the existing labelling schemes and classified them into three categories based on certain criteria, in order to identify the limitations and challenges of these labelling schemes. Based on the outcomes of this investigation, this thesis proposed a state-of-theart labelling scheme, called clustering-based labelling scheme, to resolve or improve the key limitations such as the efficiency of the XML query processing, labelling XML nodes, and XML updates cost. This thesis argued that using certain existing labelling schemes to label nodes, and using the clustering-based techniques can improve query and labelling nodes efficiency. Theoretically, the proposed scheme is based on dividing the nodes of an XML document into clusters. Two existing labelling schemes, which are the Dewey and LLS labelling schemes, were selected for labelling these clusters and their nodes. Subsequently, the proposed scheme was designed and implemented. In addition, the Dewey and LLS labelling scheme were implemented for the purpose of evaluating the proposed scheme. Subsequently, four experiments were designed in order to test the proposed scheme against the Dewey and LLS labelling schemes. The results of these experiments suggest that the proposed scheme achieved better results than the Dewey and LLS schemes. Consequently, the research hypothesis was accepted overall with few exceptions, and the proposed scheme showed an improvement in the performance and all the targeted features and aspects

    The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

    Get PDF
    Humanities researchers are faced with an overwhelming volume of digitised primary source material, and "born digital" information, of relevance to their research as a result of large-scale digitisation projects. The current digital tools do not provide consistent support for analysing the content of digital archives that are potentially large in scale, multilingual, and come in a range of data formats. The current language-dependent, or project specific, approach to tool development often puts the tools out of reach for many research disciplines in the humanities. In addition, the tools can be incompatible with the way researchers locate and compare the relevant sources. For instance, researchers are interested in shared structural text patterns, known as \parallel passages" that describe a specific cultural, social, or historical context relevant to their research topic. Identifying these shared structural text patterns is challenging due to their repeated yet highly variable nature, as a result of differences in the domain, author, language, time period, and orthography. The contribution of the thesis is a novel infrastructure that directly addresses the need for generic, flexible, extendable, and sustainable digital tools that are applicable to a wide range of digital archives and research in the humanities. The infrastructure adopts a character-level n-gram Statistical Language Model (SLM), stored in a space-optimised k-truncated suffix tree data structure as its underlying data model. A character-level n-gram model is a relatively new approach that is competitive with word-level n-gram models, but has the added advantage that it is domain and language-independent, requiring little or no preprocessing of the document text unlike word-level models that require some form of language-dependent tokenisation and stemming. Character-level n-grams capture word internal features that are ignored by word-level n-gram models, which provides greater exibility in addressing the information need of the user through tolerant search, and compensation for erroneous query specification or spelling errors in the document text. Furthermore, the SLM provides a unified approach to information retrieval and text mining, where traditional approaches have tended to adopt separate data models that are often ad-hoc or based on heuristic assumptions. In addition, the performance of the character-level n-gram SLM was formally evaluated through crowdsourcing, which demonstrates that the retrieval performance of the SLM is close to that of the human level performance. The proposed infrastructure, supports the development of the Samtla (Search And Mining Tools for Language Archives), which provides humanities researchers digital tools for search, browsing, and text mining of digital archives in any domain or language, within a single system. Samtla supersedes many of the existing tools for humanities researchers, by supporting the same or similar functionality of the systems, but with a domain-independent and languageindependent approach. The functionality includes a browsing tool constructed from the metadata and named entities extracted from the document text, a hybrid-recommendation system for recommending related queries and documents. However, some tools are novel tools and developed in response to the specific needs of the researchers, such as the document comparison tool for visualising shared sequences between groups of related documents. Furthermore, Samtla is the first practical example of a system with a SLM as its primary data model that supports the real research needs of several case studies covering different areas of research in the humanities

    The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

    Get PDF
    Humanities researchers are faced with an overwhelming volume of digitised primary source material, and "born digital" information, of relevance to their research as a result of large-scale digitisation projects. The current digital tools do not provide consistent support for analysing the content of digital archives that are potentially large in scale, multilingual, and come in a range of data formats. The current language-dependent, or project specific, approach to tool development often puts the tools out of reach for many research disciplines in the humanities. In addition, the tools can be incompatible with the way researchers locate and compare the relevant sources. For instance, researchers are interested in shared structural text patterns, known as \parallel passages" that describe a specific cultural, social, or historical context relevant to their research topic. Identifying these shared structural text patterns is challenging due to their repeated yet highly variable nature, as a result of differences in the domain, author, language, time period, and orthography. The contribution of the thesis is a novel infrastructure that directly addresses the need for generic, flexible, extendable, and sustainable digital tools that are applicable to a wide range of digital archives and research in the humanities. The infrastructure adopts a character-level n-gram Statistical Language Model (SLM), stored in a space-optimised k-truncated suffix tree data structure as its underlying data model. A character-level n-gram model is a relatively new approach that is competitive with word-level n-gram models, but has the added advantage that it is domain and language-independent, requiring little or no preprocessing of the document text unlike word-level models that require some form of language-dependent tokenisation and stemming. Character-level n-grams capture word internal features that are ignored by word-level n-gram models, which provides greater exibility in addressing the information need of the user through tolerant search, and compensation for erroneous query specification or spelling errors in the document text. Furthermore, the SLM provides a unified approach to information retrieval and text mining, where traditional approaches have tended to adopt separate data models that are often ad-hoc or based on heuristic assumptions. In addition, the performance of the character-level n-gram SLM was formally evaluated through crowdsourcing, which demonstrates that the retrieval performance of the SLM is close to that of the human level performance. The proposed infrastructure, supports the development of the Samtla (Search And Mining Tools for Language Archives), which provides humanities researchers digital tools for search, browsing, and text mining of digital archives in any domain or language, within a single system. Samtla supersedes many of the existing tools for humanities researchers, by supporting the same or similar functionality of the systems, but with a domain-independent and languageindependent approach. The functionality includes a browsing tool constructed from the metadata and named entities extracted from the document text, a hybrid-recommendation system for recommending related queries and documents. However, some tools are novel tools and developed in response to the specific needs of the researchers, such as the document comparison tool for visualising shared sequences between groups of related documents. Furthermore, Samtla is the first practical example of a system with a SLM as its primary data model that supports the real research needs of several case studies covering different areas of research in the humanities
    corecore