1,663 research outputs found

    Improving OCR Post Processing with Machine Learning Tools

    Full text link
    Optical Character Recognition (OCR) Post Processing involves data cleaning steps for documents that were digitized, such as a book or a newspaper article. One step in this process is the identification and correction of spelling and grammar errors generated due to the flaws in the OCR system. This work is a report on our efforts to enhance the post processing for large repositories of documents. The main contributions of this work are: • Development of tools and methodologies to build both OCR and ground truth text correspondence for training and testing of proposed techniques in our experiments. In particular, we will explain the alignment problem and tackle it with our de novo algorithm that has shown a high success rate. • Exploration of the Google Web 1T corpus to correct errors using context. We show that over half of the errors in the OCR text can be detected and corrected. • Applications of machine learning tools to generalize the past ad hoc approaches to OCR error corrections. As an example, we investigate the use of logistic regression to select the correct replacement for misspellings in the OCR text. • Use of container technology to address the state of reproducible research in OCR and Computer Science as a whole. Many of the past experiments in the field of OCR are not considered reproducible research questioning whether the original results were outliers or finessed

    Rapid, label-free disease diagnostics by surface enhanced Raman spectroscopy

    Get PDF
    Surface-Enhanced Raman Scattering (SERS) has the potential to be a rapid disease diagnostic platform. SERS is a well-known ultrasensitive, label-free method for the detection and identification of molecules at low concentrations. The Raman cross-sections are primarily enhanced by plasmonic effects for molecules close to (< 5 nm) the surface of nanostructured metal substrates. Due to the unique Raman vibration features that provide molecular signatures, we have shown that SERS can provide a rapid (< one hour), label-free, sensitive and specific diagnosis for a number of diseases. This work demonstrates the capability of SERS to be an effective optical diagnostic approach, in particular, for bacterial infectious diseases such as urinary tract infections (UTI) and sexually transmitted diseases (STD), and cancer cell identification. More specifically, this work demonstrates the ability of SERS to distinguish different vegetative bacterial cells with species and strain specificity based on their intrinsic SERS molecular signatures. With the exception of C. trachomatis - the causative agent of chlamydia - whose SERS molecular signatures are found to be aggregated proteins on the cell membrane, all bacterial SERS molecular signatures are due to purine molecules resulting from nucleic acid metabolism as part of the rapid onset of the starvation response of these pathogens. The differences in relative contribution of different purine metabolites for each bacterium gives rise to the SERS strain and species specificity. The ability of SERS to distinguish cancer and normal cells grown in vitro based on changes of SERS spectral feature as a function of time after sample processing is also demonstrated. Furthermore, the difference of spectral features on the gold and silver SERS substrate of the same bacteria can be used as additional attribute for identification. This work demonstrate the potential of SERS platform to provide antibiotic-specific diagnostics in clinical settings within one hour when combined with a portable Raman microscopy instrument, an effective enrichment procedure, multivariate data analysis and an expendable SERS reference library with drug-susceptibility profile for each bacterial strain determined a priori, as well as the ability of SERS platform as a powerful bioanalytical probe for learning about near cell membrane biochemical processes
    • …
    corecore