5 research outputs found

    Metrics for Complete Evaluation of OCR Performance

    Get PDF
    International audienceIn this paper, we study metrics for evaluating OCR performance both in terms of physical segmentation and in terms of textual content recognition. These metrics rely on the OCR output (hypothesis) and the reference (also called ground truth) input format. Two evaluation criteria are considered: the quality of segmentation and the character recognition rate. Three pairs of input formats are selected among two types of inputs: text only (text) and text with spatial information (xml). These pairs of inputs reference-to-hypothesis are: 1) text-to-text, 2) xml-to-xml and 3) text-to-xml. For the text-to-text pair, we selected the RETAS method to perform experiments and show its limits. Regarding text-to-xml, a new method based on unique word anchors is proposed to solve the problem of aligning texts with different information. We define the ZoneMapAltCnt metric for the xml-to-xml approach and show that it offers the most reliable and complete evaluation compared to the other two. Open source OCRs like Tesseract and OCRopus are selected to perform experiments. The datasets used are a collection of documents from the ISTEX 1 document database, from French newspaper "Le Nouvel Observateur" as well as invoices and administrative document gathered from different collaborations

    Copisti Digitali e Filologi Computazionali

    Get PDF
    Il volume è formato da dieci capitoli e mette insieme, elaborandoli ed aggiornandoli, materiali delle due tesi di dottorato dell’autore, una in Filologia Classica (2005) e l’altra in Linguistica Computazionale (2010), entrambe discusse presso l’Università di Trento. Dopo una breve introduzione sul concetto di filologia collaborativa e cooperativa, i primi capitoli sono dedicati all’ecdotica digitale, quindi all’acquisizione del testo di edizioni critiche tramite OCR e al trattamento computazionale di apparati critici e repertori di congetture. I capitoli seguenti sono dedicati ad aspetti salienti dell’ermeneutica digitale, come l’analisi sintattica tramite la creazione di treebanks e l’analisi lessico-semantica tramite la creazione di wordnets e l’esplorazione di word spaces con metodi statistici. Chiudono il volume un capitolo di discussione relativa a punti critici del testo usato come caso di studio (I Persiani di Eschilo) e un capitolo di conclusioni e prospettive di ricerca

    Collaborative Research Practices and Shared Infrastructures for Humanities Computing

    Get PDF
    The volume collect the proceedings of the 2nd Annual Conference of the Italian Association for Digital Humanities (Aiucd 2013), which took place at the Department of Information Engineering of the University of Padua, 11-12 December 2013. The general theme of Aiucd 2013 was “Collaborative Research Practices and Shared Infrastructures for Humanities Computing” so we particularly welcomed submissions on interdisciplinary work and new developments in the field, encouraging proposals relating to the theme of the conference, or more specifically: interdisciplinarity and multidisciplinarity, legal and economic issues, tools and collaborative methodologies, measurement and impact of collaborative methodologies, sharing and collaboration methods and approaches, cultural institutions and collaborative facilities, infrastructures and digital libraries as collaborative environments, data resources and technologies sharing

    Collaborative Research Practices and Shared Infrastructures for Humanities Computing

    Get PDF
    The volume collect the proceedings of the 2nd Annual Conference of the Italian Association for Digital Humanities (Aiucd 2013), which took place at the Department of Information Engineering of the University of Padua, 11-12 December 2013. The general theme of Aiucd 2013 was “Collaborative Research Practices and Shared Infrastructures for Humanities Computing” so we particularly welcomed submissions on interdisciplinary work and new developments in the field, encouraging proposals relating to the theme of the conference, or more specifically: interdisciplinarity and multidisciplinarity, legal and economic issues, tools and collaborative methodologies, measurement and impact of collaborative methodologies, sharing and collaboration methods and approaches, cultural institutions and collaborative facilities, infrastructures and digital libraries as collaborative environments, data resources and technologies sharing