41 research outputs found
Impact Analysis of OCR Quality on Research Tasks in Digital Archives
Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved
MBA: a literature mining system for extracting biomedical abbreviations
<p>Abstract</p> <p>Background</p> <p>The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter.</p> <p>Results</p> <p>A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus.</p> <p>Conclusion</p> <p>We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., <CNS1, cyclophilin seven suppressor>), but also non-acronym-type abbreviations (e.g., <Fas, CD95>).</p
La senda cosmológica y alquímica de siete colores (haft rang) en el Haft paykar de Niẓāmī Ganǧawī (m. ca. 570-610/1174-1222)
A partir de una miniatura perteneciente a una magnífica Antología de Iskandar (Persia, Šīrāz, 1410-11, folio 66v; Lisboa, Fundación Calouste Gulbenkian), se analiza la cosmología, el simbolismo de los siete colores (haft rang) y la progresión alquímica en el relato místico Haft paykar (Siete princesas) del poeta persa Abū Md. Ilyās b. Yūsuf Niẓāmī Ganǧawī (m. ca. 570-610/1174-1222). Esta senda de transmutación interior por medio de los colores (cc.:pp.: negro: HP26:180-181; amarillo: HP27:196-197; verde: HP28:214; rojo: HP29:234; azul: HP30:266-267; sándalo: HP31:291 y blanco: HP32:315) finaliza, en sintonía con la tradición irania (mazdeísmo, zoroastrismo, sabiduría iluminativa išrāqī), con el color blanco, símbolo de la pureza del alma y la iluminación.From a miniature belonging to a magnificent Iskandar Anthology (Persia, Šīrāz, 1410-11, folio 66v; Lisbon, Calouste Gulbenkian Foundation), the cosmology, the symbolism of the seven colors (haft rang) and the alchemical progression are analyzed in the mystical tale Haft paykar (Seven princesses) by the Persian poet Abū Md. Ilyās b. Yūsuf Niẓāmī Ganǧawī (d. ca. 570-610 / 1174-1222). This inner path of transmutation through colors (ch.:pp.: black: HP26:180-181; yellow: HP27:196-197; green: HP28:214; red: HP29:234; blue: HP30:266-267; sandalwood: HP31:291 and white: HP32:315) ends, in tune with the Iranian tradition (Mazdeism, Zoroastrianism, išrāqī illuminative wisdom), with the color white, symbol of the purity of the soul and enlightenment
Automatic Text Summarization based on Word-Clusters and Ranking Algorithms
International audienceThis paper investigates a new approach for Single Document Summarization based on a Machine Learning ranking algorithm. The use of machine learning techniques for this task allows one to adapt summaries to the user needs and to the corpus characteristics. These desirable properties have motivated an increasing amount of work in this field over the last few years. Most approaches attempt to generate summaries by extracting text-spans (sentences in our case) and adopt the classification framework which consists to train a classifier in order to discriminate between relevant and irrelevant spans of a document. A set of features is first used to produce a vector of scores for each sentence in a given document and a classifier is trained in order to make a global combination of these scores. We believe that the classification criterion for training a classifier is not adapted for SDS and propose an original framework based on ranking for this task. A ranking algorithm also combines the scores of different features but its criterion tends to reduce the relative misordering of sentences within a document. Features we use here are either based on the state-of-the-art or built upon word-clusters. These clusters are groups of words which often co-occur with each other, and can serve to expand a query or to enrich the representation of the sentences of the documents. We analyze the performance of our ranking algorithm on two data sets – the Computation and Language (cmp_lg) collection of TIPSTER SUMMAC and the WIPO collection. We perform comparisons with different baseline – non learning – systems, and a reference trainable summarizer system based on the classification framework. The experiments show that the learning algorithms perform better than the non-learning systems while the ranking algorithm outperforms the classifier. The difference of performance between the two learning algorithms depends on the nature of datasets. We give an explanation of this fact by the different separability hypothesis of the data made by the two learning algorithms
Results of Applying Probabilistic IR to OCR Text
Character accuracy of optically recognized text is considered a basic measure for evaluating OCR devices. In the broader sense, another fundamental measure of an OCR's goodness is whether its generated text is usable for retrieving information. In this study, we evaluate retrieval effectiveness from OCR text databases using a probabilistic IR system. We compare these retrieval results to their manually corrected equivalent. We show there is no statistical difference in precision and recall using graded accuracy levels from three OCR devices. However, characteristics of the OCR data have side effects that could cause unstable results with this IR model. In particular, we found individual queries can be greatly affected. Knowing the qualities of OCR text, we compensate for them by applying an automatic post-processing system that improves effectiveness. 1 Introduction Anyone who has performed research in either optical character recognition (OCR) or information retrieval (IR) will atte..
OCR correction based on document level knowledge
For over 10 years, the Information Science Research Institute (ISRI) at UNLV has worked on problems associated with the electronic conversion of archival document collections. Such collections typically have a large fraction of poor quality images and present a special challenge to OCR systems. Frequently, because of the size of the collection, manual correction of the output is not affordable. Because the output text is used only to build the index for an information retrieval (IR) system, the accuracy of non-stopwords is the most important measure of output quality. For these reasons, ISRI has focused on using document level knowledge as the best means of providing automatic correction of nonstopwords in OCR output. In 1998, we developed the MANICURE [1] post-processing system that combined several document level corrections. Because of the high cost of obtaining accurate ground-truth text at the document level, we have never been able to quantify the accuracy improvement achievable using document level knowledge. In this report, we describe an experiment to measure the actual number (and percentage) of non-stopwords corrected by the MANICURE system. We believe this to be the first quantitative measure of OCR conversion improvement that is possible using document level knowledge