29 research outputs found

    Assessing the impact of OCR quality on downstream NLP tasks

    Get PDF
    A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR

    Population genetic structure of Streptococcus pneumoniae in Kilifi, Kenya, prior to the introduction of pneumococcal conjugate vaccine.

    Get PDF
    BACKGROUND: The 10-valent pneumococcal conjugate vaccine (PCV10) was introduced in Kenya in 2011. Introduction of any PCV will perturb the existing pneumococcal population structure, thus the aim was to genotype pneumococci collected in Kilifi before PCV10. METHODS AND FINDINGS: Using multilocus sequence typing (MLST), we genotyped >1100 invasive and carriage pneumococci from children, the largest collection genotyped from a single resource-poor country and reported to date. Serotype 1 was the most common serotype causing invasive disease and was rarely detected in carriage; all serotype 1 isolates were members of clonal complex (CC) 217. There were temporal fluctuations in the major circulating sequence types (STs); and although 1-3 major serotype 1, 14 or 23F STs co-circulated annually, the two major serotype 5 STs mainly circulated independently. Major STs/CCs also included isolates of serotypes 3, 12F, 18C and 19A and each shared ≤ 2 MLST alleles with STs that circulate widely elsewhere. Major CCs associated with non-PCV10 serotypes were predominantly represented by carriage isolates, although serotype 19A and 12F CCs were largely invasive and a serotype 10A CC was equally represented by invasive and carriage isolates. CONCLUSIONS: Understanding the pre-PCV10 population genetic structure in Kilifi will allow for the detection of changes in prevalence of the circulating genotypes and evidence for capsular switching post-vaccine implementation

    Some pneumococcal serotypes are more frequently associated with relapses of acute exacerbations in COPD patients

    Get PDF
    Objectives: To analyze the role of the capsular type in pneumococci causing relapse and reinfection episodes of acute exacerbation in COPD patients. Methods: A total of 79 patients with 116 recurrent episodes of acute exacerbations caused by S. pneumoniae were included into this study (1995–2010). A relapse episode was considered when two consecutive episodes were caused by the same strain (identical serotype and genotype); otherwise it was considered reinfection. Antimicrobial susceptibility testing (microdilution), serotyping (PCR, Quellung) and molecular typing (PFGE/MLST) were performed. Results: Among 116 recurrent episodes, 81 (69.8%) were reinfections, caused by the acquisition of a new pneumococcus, and 35 (30.2%) were relapses, caused by a pre-existing strain. Four serotypes (9V, 19F, 15A and 11A) caused the majority (60.0%) of relapses. When serotypes causing relapses and reinfection were compared, only two serotypes were associated with relapses: 9V (OR 8.0; 95% CI, 1.34–85.59) and 19F (OR 16.1; 95% CI, 1.84–767.20). Pneumococci isolated from relapses were more resistant to antimicrobials than those isolated from the reinfection episodes: penicillin (74.3% vs. 34.6%, p,0.001), ciprofloxacin (25.7% vs. 9.9%, p,0.027), levofloxacin (22.9% vs. 7.4%, p = 0.029), and co-trimoxazole (54.3% vs. 25.9%, p,0.001). Conclusions: Although the acquisition of a new S. pneumoniae strain was the most frequent cause of recurrences, a third of the recurrent episodes were caused by a pre-existing strain. These relapse episodes were mainly caused by serotypes 9V and 19F, suggesting an important role for capsular typ

    defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data.

    Get PDF
    This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.</p
    corecore