18,416 research outputs found

    Comparing “parallel passages” in digital archives

    Get PDF
    Purpose: The purpose of this paper is to present a language-agnostic approach to facilitate the discovery of “parallel passages” stored in historic and cultural heritage digital archives. Design/methodology/approach: The authors explore a novel, and relatively simple approach, using a character-based statistical language model combined with a tailored version of the Basic Local Alignment Tool to extract exact and approximate string patterns shared between groups of documents. Findings: The approach is applicable to a wide range of languages, and compensates for variability in the text of the documents as a result of differences in dialect, authorship, language change over time and errors due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process. Research limitations/implications: A number of case studies demonstrate that the approach is practical and generalisable to a wide range of archives with documents in different languages, domains and of varying quality. Practical implications: The approach described can be applied to any digital archive of modern and contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also those composed of more recent news articles, for example. Social implications: The analysis of “parallel passages” enables researchers to quantify the presence and extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres and cultural contexts. Originality/value: The approach is novel and addresses a need by humanities researchers for tools that can identify similar documents and local similarities represented by shared text sequences in a potentially vast large archive of documents. As far as the authors are aware, there are no tools currently exist that provide the same level of tolerance to the language of the documents

    Recognising Intertextuality in the Digital Corpus of Finnic Oral Poetry : Experiment with the Sampo Cycle

    Get PDF
    While digital corpora have enabled new perspectives into the variation and continuums of human communication, they often pose problems related to implicit biases of the data and the limited reach of current methods in recognising similarity in linguistically complex data, especially in small languages. The digital corpus of historical Finnic oral poetry in alliterative tetrametre is characterised by significant poetic, linguistic and orthographic variation. At the extreme, a word may be written in hundreds of different ways. The current corpus comprises 189,189 poetic texts in six Finnic languages (Karelian, Ingrian, Votic, Estonian, Seto and Finnish) recorded in 1564–1957 by 5,287 recorders. It has a long curation history and significant bias towards some genres, poetic forms and regions that collectors have preferred. In this poetic tradition, an idea is typically expressed with several parallel, partly alternative poetic lines or motifs, and similar verse types may be used in different contexts. A manual attempt to find all the occurrences of widely used expressions or motifs in the corpus is an unattainable task. While the digital tools—starting from simple queries to more advanced methods—make it possible to aim at wider intertextual analyses, some part of relevant material is typically not reached. Thus, it becomes central to estimate the amount and quality of the relevant data that is not recognised with different methods. Here, we discuss two strategies for mapping intertextuality in the corpus: 1) proceeding with text queries and 2) recognising similar poetic lines computationally, based on string similarity. We compare these approaches with one another, and then proceed to compare the results they yield with the existing type index and the results of manual early 20th-century research. While the methodological and theoretical foundations of this type of research no longer hold, and while our further interest lies in the intertextuality and variation rather than in the problematic concept of poem types, parts of earlier analyses may be used in evaluating the performance of digital approaches.Peer reviewe

    The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

    Get PDF
    Humanities researchers are faced with an overwhelming volume of digitised primary source material, and "born digital" information, of relevance to their research as a result of large-scale digitisation projects. The current digital tools do not provide consistent support for analysing the content of digital archives that are potentially large in scale, multilingual, and come in a range of data formats. The current language-dependent, or project specific, approach to tool development often puts the tools out of reach for many research disciplines in the humanities. In addition, the tools can be incompatible with the way researchers locate and compare the relevant sources. For instance, researchers are interested in shared structural text patterns, known as \parallel passages" that describe a specific cultural, social, or historical context relevant to their research topic. Identifying these shared structural text patterns is challenging due to their repeated yet highly variable nature, as a result of differences in the domain, author, language, time period, and orthography. The contribution of the thesis is a novel infrastructure that directly addresses the need for generic, flexible, extendable, and sustainable digital tools that are applicable to a wide range of digital archives and research in the humanities. The infrastructure adopts a character-level n-gram Statistical Language Model (SLM), stored in a space-optimised k-truncated suffix tree data structure as its underlying data model. A character-level n-gram model is a relatively new approach that is competitive with word-level n-gram models, but has the added advantage that it is domain and language-independent, requiring little or no preprocessing of the document text unlike word-level models that require some form of language-dependent tokenisation and stemming. Character-level n-grams capture word internal features that are ignored by word-level n-gram models, which provides greater exibility in addressing the information need of the user through tolerant search, and compensation for erroneous query specification or spelling errors in the document text. Furthermore, the SLM provides a unified approach to information retrieval and text mining, where traditional approaches have tended to adopt separate data models that are often ad-hoc or based on heuristic assumptions. In addition, the performance of the character-level n-gram SLM was formally evaluated through crowdsourcing, which demonstrates that the retrieval performance of the SLM is close to that of the human level performance. The proposed infrastructure, supports the development of the Samtla (Search And Mining Tools for Language Archives), which provides humanities researchers digital tools for search, browsing, and text mining of digital archives in any domain or language, within a single system. Samtla supersedes many of the existing tools for humanities researchers, by supporting the same or similar functionality of the systems, but with a domain-independent and languageindependent approach. The functionality includes a browsing tool constructed from the metadata and named entities extracted from the document text, a hybrid-recommendation system for recommending related queries and documents. However, some tools are novel tools and developed in response to the specific needs of the researchers, such as the document comparison tool for visualising shared sequences between groups of related documents. Furthermore, Samtla is the first practical example of a system with a SLM as its primary data model that supports the real research needs of several case studies covering different areas of research in the humanities

    A model for annotating musical versions and arrangements across multiple documents and media

    Get PDF
    We present a model for the annotation of musical works, where the annotations are created with respect to a conceptual abstraction of the music instead of directly to concrete encodings. This supports musicologists in constructing arguments about musical elements that occur in multiple digital library sources (or other web resources), that recur across a work, or that appear in different forms in different arrangements. It provides a way of discussing musical content without tying that discourse to the location, notation or medium of the content, allowing evidence from multiple libraries and in different formats to be brought together to support musicological assertions. This model is implemented in Linked Data and illustrated in a prototype application in which musicologists annotate vocal arrangements of the Allegretto from Beethoven’s Seventh Symphony from multiple sources

    Using term clouds to represent segment-level semantic content of podcasts

    Get PDF
    Spoken audio, like any time-continuous medium, is notoriously difficult to browse or skim without support of an interface providing semantically annotated jump points to signal the user where to listen in. Creation of time-aligned metadata by human annotators is prohibitively expensive, motivating the investigation of representations of segment-level semantic content based on transcripts generated by automatic speech recognition (ASR). This paper examines the feasibility of using term clouds to provide users with a structured representation of the semantic content of podcast episodes. Podcast episodes are visualized as a series of sub-episode segments, each represented by a term cloud derived from a transcript generated by automatic speech recognition (ASR). Quality of segment-level term clouds is measured quantitatively and their utility is investigated using a small-scale user study based on human labeled segment boundaries. Since the segment-level clouds generated from ASR-transcripts prove useful, we examine an adaptation of text tiling techniques to speech in order to be able to generate segments as part of a completely automated indexing and structuring system for browsing of spoken audio. Results demonstrate that the segments generated are comparable with human selected segment boundaries

    Jaina Tantra : SOAS Jaina Studies Workshop 2015

    Get PDF

    DARIAH and the Benelux

    Get PDF

    Generation of Oligodendrocyte Progenitor Cells From Mouse Bone Marrow Cells.

    Get PDF
    Oligodendrocyte progenitor cells (OPCs) are a subtype of glial cells responsible for myelin regeneration. Oligodendrocytes (OLGs) originate from OPCs and are the myelinating cells in the central nervous system (CNS). OLGs play an important role in the context of lesions in which myelin loss occurs. Even though many protocols for isolating OPCs have been published, their cellular yield remains a limit for clinical application. The protocol proposed here is novel and has practical value; in fact, OPCs can be generated from a source of autologous cells without gene manipulation. Our method represents a rapid, and high-efficiency differentiation protocol for generating mouse OLGs from bone marrow-derived cells using growth-factor defined media. With this protocol, it is possible to obtain mature OLGs in 7-8 weeks. Within 2-3 weeks from bone marrow (BM) isolation, after neurospheres formed, the cells differentiate into Nestin+ Sox2+ neural stem cells (NSCs), around 30 days. OPCs specific markers start to be expressed around day 38, followed by RIP+O4+ around day 42. CNPase+ mature OLGs are finally obtained around 7-8 weeks. Further, bone marrow-derived OPCs exhibited therapeutic effect in shiverer (Shi) mice, promoting myelin regeneration and reducing the tremor. Here, we propose a method by which OLGs can be generated starting from BM cells and have similar abilities to subventricular zone (SVZ)-derived cells. This protocol significantly decreases the timing and costs of the OLGs differentiation within 2 months of culture

    Is searching full text more effective than searching abstracts?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE<sup>® </sup>abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: <it>bm25 </it>and the ranking algorithm implemented in the open-source Lucene search engine.</p> <p>Results</p> <p>Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles.</p> <p>Conclusion</p> <p>Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.</p
    corecore