18 research outputs found

    Reflections on Infrastructures for Mining Nineteenth-Century Newspaper Data

    Get PDF
    In this study we compare and contrast our experiences (as historians and as digital humanities and information studies researchers) of seeking to mine large-scale historical datasets via university-based, high-performance computing infrastructures versus our experiences of using external, cloud-hosted platforms and tools to mine the same data. In particular, we reflect on our recent experiences in two large transnational digital humanities projects: Asymmetrical Encounters: E-Humanity Approaches to Reference Cultures in Europe, 1815–1992, which was funded by a Humanities in the European Research Area grant (2013–2016) and Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories 1840–1914, which was funded through the Transatlantic Partnership for Social Sciences and Humanities 2016 Digging into Data Challenge (2017–2019). As part of the research for both these projects we sought to mine the OCR text of nineteenth-century historical newspapers that had been mounted on UCL’s HighPerformance Computing Infrastructures from Gale’s TDM drives. We compare and contrast our experiences of this with our subsequent experiences of performing comparable tasks via Gale Digital Scholar Lab. We contextualise our experiences and observations within wider discourses and recommendations about infrastructural support for humanities-led analyses of large datasets and discuss the advantages and drawbacks of both approaches. We situate our discussions in the aforementioned infrastructural scenarios with reflections on the human experiences of undertaking this research, which represents a step change for many of those who work in the (digital) humanities. Finally, we conclude by discussing the public and private sector research investments that are needed to support further developments and to facilitate access to and critical interrogation of large-scale digital archive

    Of global reach yet of situated contexts: an examination of the implicit and explicit selection criteria that shape digital archives of historical newspapers

    Get PDF
    A large literature addresses the processes, circumstances and motivations that have given rise to archives. These questions are increasingly being asked of digital archives, too. Here, we examine the complex interplay of institutional, intellectual, economic, technical, practical and social factors that have shaped decisions about the inclusion and exclusion of digitised newspapers in and from online archives. We do so by undertaking and analysing a series of semi-structured interviews conducted with public and private providers of major newspaper digitisation programmes. Our findings contribute to emerging understandings of factors that are rarely foregrounded or highlighted, yet fundamentally shape the depth and scope of digital cultural heritage archives and thus the questions that can be asked of them, now and in the future. Moreover, we draw attention to providers’ emphasis on meeting the needs of their end-users and how this is shaping the form and function of digital archives. The end user is not often emphasised in the wider literature on archival studies and we thus draw attention to the potential merit of this vector in future studies of digital archives

    Species-level functional profiling of metagenomes and metatranscriptomes.

    Get PDF
    Functional profiles of microbial communities are typically generated using comprehensive metagenomic or metatranscriptomic sequence read searches, which are time-consuming, prone to spurious mapping, and often limited to community-level quantification. We developed HUMAnN2, a tiered search strategy that enables fast, accurate, and species-resolved functional profiling of host-associated and environmental communities. HUMAnN2 identifies a community's known species, aligns reads to their pangenomes, performs translated search on unclassified reads, and finally quantifies gene families and pathways. Relative to pure translated search, HUMAnN2 is faster and produces more accurate gene family profiles. We applied HUMAnN2 to study clinal variation in marine metabolism, ecological contribution patterns among human microbiome pathways, variation in species' genomic versus transcriptional contributions, and strain profiling. Further, we introduce 'contributional diversity' to explain patterns of ecological assembly across different microbial community types

    The sequences of 150,119 genomes in the UK Biobank

    Get PDF
    Detailed knowledge of how diversity in the sequence of the human genome affects phenotypic diversity depends on a comprehensive and reliable characterization of both sequences and phenotypic variation. Over the past decade, insights into this relationship have been obtained from whole-exome sequencing or whole-genome sequencing of large cohorts with rich phenotypic data(1,2). Here we describe the analysis of whole-genome sequencing of 150,119 individuals from the UK Biobank(3). This constitutes a set of high-quality variants, including 585,040,410 single-nucleotide polymorphisms, representing 7.0% of all possible human single-nucleotide polymorphisms, and 58,707,036 indels. This large set of variants allows us to characterize selection based on sequence variation within a population through a depletion rank score of windows along the genome. Depletion rank analysis shows that coding exons represent a small fraction of regions in the genome subject to strong sequence conservation. We define three cohorts within the UK Biobank: a large British Irish cohort, a smaller African cohort and a South Asian cohort. A haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals. We identified 895,055 structural variants and 2,536,688 microsatellites, groups of variants typically excluded from large-scale whole-genome sequencing studies. Using this formidable new resource, we provide several examples of trait associations for rare variants with large effects not found previously through studies based on whole-exome sequencing and/or imputation

    A new MRI rating scale for progressive supranuclear palsy and multiple system atrophy: validity and reliability

    Get PDF
    AIM To evaluate a standardised MRI acquisition protocol and a new image rating scale for disease severity in patients with progressive supranuclear palsy (PSP) and multiple systems atrophy (MSA) in a large multicentre study. METHODS The MRI protocol consisted of two-dimensional sagittal and axial T1, axial PD, and axial and coronal T2 weighted acquisitions. The 32 item ordinal scale evaluated abnormalities within the basal ganglia and posterior fossa, blind to diagnosis. Among 760 patients in the study population (PSP = 362, MSA = 398), 627 had per protocol images (PSP = 297, MSA = 330). Intra-rater (n = 60) and inter-rater (n = 555) reliability were assessed through Cohen's statistic, and scale structure through principal component analysis (PCA) (n = 441). Internal consistency and reliability were checked. Discriminant and predictive validity of extracted factors and total scores were tested for disease severity as per clinical diagnosis. RESULTS Intra-rater and inter-rater reliability were acceptable for 25 (78%) of the items scored (≥ 0.41). PCA revealed four meaningful clusters of covarying parameters (factor (F) F1: brainstem and cerebellum; F2: midbrain; F3: putamen; F4: other basal ganglia) with good to excellent internal consistency (Cronbach α 0.75-0.93) and moderate to excellent reliability (intraclass coefficient: F1: 0.92; F2: 0.79; F3: 0.71; F4: 0.49). The total score significantly discriminated for disease severity or diagnosis; factorial scores differentially discriminated for disease severity according to diagnosis (PSP: F1-F2; MSA: F2-F3). The total score was significantly related to survival in PSP (p<0.0007) or MSA (p<0.0005), indicating good predictive validity. CONCLUSIONS The scale is suitable for use in the context of multicentre studies and can reliably and consistently measure MRI abnormalities in PSP and MSA. Clinical Trial Registration Number The study protocol was filed in the open clinical trial registry (http://www.clinicaltrials.gov) with ID No NCT00211224

    defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data.

    Get PDF
    This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.</p

    EPR-Dictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices

    No full text
    The unidirectional FM index was introduced by Ferragina and Manzini in 2000 and allows to search a pattern in the index in one direction. The bidirectional FM index (2FM) was introduced by Lam et al. in 2009. It allows to search for a pattern by extending an infix of the pattern arbitrarily to the left or right. If σ is the size of the alphabet then the method of Lam et al. can conduct one step in time O(σ) while needing space O(σ⋅n) using constant time rank queries on bit vectors. Schnattinger and colleagues improved this time to O(logσ) while using O(logσ⋅n) bits of space for both, the FM and 2FM index. This is achieved by the use of binary wavelet trees. In this paper we introduce a new, practical method for conducting an exact search in a uni- and bidirectional FM index in O(1) time per step while using O(logσ⋅n)+o(logσ⋅σ⋅n) bits of space. This is done by replacing the binary wavelet tree by a new data structure, the Enhanced Prefixsum Rank dictionary (EPR-dictionary). We implemented this method in the SeqAn C++ library and experimentally validated our theoretical results. In addition we compared our implementation with other freely available implementations of bidirectional indices and show that we are between ≈2.2−4.2 times faster. This will have a large impact for many bioinformatics applications that rely on practical implementations of (2)FM indices e.g. for read mapping. To our knowledge this is the first implementation of a constant time method for a search step in 2FM indices
    corecore