3,306 research outputs found
Optical Character Recognition of Printed Persian/Arabic Documents
Texts are an important representation of language. Due to the volume of texts generated and the historical value of some documents, it is imperative to use computers to read generated texts, and make them editable and searchable. This task, however, is not trivial. Recreating human perception capabilities in artificial systems like documents is one of the major goals of pattern recognition research. After decades of research and improvements in computing capabilities, humans\u27 ability to read typed or handwritten text is hardly matched by machine intelligence. Although, classical applications of Optical Character Recognition (OCR) like reading machine-printed addresses in a mail sorting machine is considered solved, more complex scripts or handwritten texts push the limits of the existing technology. Moreover, many of the existing OCR systems are language dependent. Therefore, improvements in OCR technologies have been uneven across different languages. Especially, for Persian, there has been limited research. Despite the need to process many Persian historical documents or use of OCR in variety of applications, few Persian OCR systems work with good recognition rate. Consequently, the task of automatically reading Persian typed documents with close-to-human performance is still an open problem and the main focus of this dissertation. In this dissertation, after a literature survey of the existing technology, we propose new techniques in the two important preprocessing steps in any OCR system: Skew detection and Page segmentation. Then, rather than the usual practice of character segmentation, we propose segmentation of Persian documents into sub-words. The choice of sub-word segmentation is to avoid the challenges of segmenting highly cursive Persian texts to isolated characters. For feature extraction, we will propose a hybrid scheme between three commonly used methods and finally use a nonparametric classification method. A large number of papers and patents advertise recognition rates near 100%. Such claims give the impression that automation problems seem to have been solved. Although OCR is widely used, its accuracy today is still far from a child\u27s reading skills. Failure of some real applications show that performance problems still exist on composite and degraded documents and that there is still room for progress
Anonimização automatizada de contratos jurídicos em português
With the introduction of the General Data Protection Regulation, many organizations
were left with a large amount of documents containing public information
that should have been private. Given that we are talking about quite large quantities
of documents, it would be a waste of resources to edit them manually. The
objective of this dissertation is the development of an autonomous system for the
anonymization of sensitive information in contracts written in Portuguese.
This system uses Google Cloud Vision, an API to apply the OCR tecnology, to
extract any text present in a document. As there is a possibility that these documents
are poorly readable, an image pre-processing is done using the OpenCV
library to increase the readability of the text present in the images. Among others,
the application of binarization, skew correction and noise removal algorithms were
explored.
Once the text has been extracted, it will be interpreted by an NLP library. In this
project we chose to use spaCy, which contains a Portuguese pipeline trained with
the WikiNer and UD Portuguese Bosque datasets. This library not only allows a
very complete identification of the part of speech, but also contains four different
categories of named entity recognition in its model. In addition to the processing
carried out using the spaCy library, and since the Portuguese language does not
have a great support, some rule-based algorithms were implemented in order to
identify other types of more specific information such as identification number and
postal codes. In the end, the information considered confidential is covered by
a black rectangle drawn by OpenCV through the coordinates returned by Google
Cloud Vision OCR and a new PDF is generated.Com a introdução do Regulamento Geral de Proteção de Dados, muitas organizações
ficaram com uma grande quantidade de documentos contendo informações
públicas que deveriam ser privadas. Dado que estamos a falar de quantidades
bastante elevadas de documentos, seria um desperdício de recursos editá-los manualmente.
O objetivo desta dissertação é o desenvovimento de um sistema autónomo
de anonimização de informação sensível em contratos escritos na língua
Portuguesa.
Este sistema utiliza a Google Cloud Vision, uma API de OCR, para extrair qualquer
texto presente num documento. Como existe a possibilidade desses documentos
serem pouco legíveis, é feito um pré-processamento de imagem através da biblioteca
OpenCV para aumentar a legibilidade do texto presente nas imagens. Entre
outros, foi explorada a aplicação de algoritmos de binarização, correção da inclinação
e remoção de ruído.
Uma vez extraído o texto, este será interpretado por uma biblioteca de nlp, neste
projeto optou-se pelo uso do spaCy, que contém um pipeline português treinado
com os conjuntos de dados WikiNer e UD Portuguese Bosque. Esta biblioteca
não permite apenas uma identificação bastante completa da parte do discurso,
mas também contém quatro categorias diferentes de reconhecimento de entidade
nomeada no seu modelo. Para além do processamento efetuado com o recurso à
biblioteca de spaCy, e uma vez que a língua portuguesa não tem um grande suporte,
foram implementados alguns algoritmos baseados em regras de modo a identificar
outros tipos de informação mais especifica como número de identificação e códigos
postais. No final, as informações consideradas confidenciais são cobertas por um
retângulo preto desenhado pelo OpenCV através das coordenadas retornadas pelo
OCR do Google Cloud Vision e será gerado um novo PDF.Mestrado em Engenharia de Computadores e Telemátic
The rectification and recognition of document images with perspective and geometric distortions
Ph.DDOCTOR OF PHILOSOPH
Adaptive Methods for Robust Document Image Understanding
A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy
How WEIRD is Usable Privacy and Security Research? (Extended Version)
In human factor fields such as human-computer interaction (HCI) and
psychology, researchers have been concerned that participants mostly come from
WEIRD (Western, Educated, Industrialized, Rich, and Democratic) countries. This
WEIRD skew may hinder understanding of diverse populations and their cultural
differences. The usable privacy and security (UPS) field has inherited many
research methodologies from research on human factor fields. We conducted a
literature review to understand the extent to which participant samples in UPS
papers were from WEIRD countries and the characteristics of the methodologies
and research topics in each user study recruiting Western or non-Western
participants. We found that the skew toward WEIRD countries in UPS is greater
than that in HCI. Geographic and linguistic barriers in the study methods and
recruitment methods may cause researchers to conduct user studies locally. In
addition, many papers did not report participant demographics, which could
hinder the replication of the reported studies, leading to low reproducibility.
To improve geographic diversity, we provide the suggestions including
facilitate replication studies, address geographic and linguistic issues of
study/recruitment methods, and facilitate research on the topics for non-WEIRD
populations.Comment: This paper is the extended version of the paper presented at USENIX
SECURITY 202
Exploring the Differences Between Pre-Service Teachers\u27 Analyses of Various Informal Reading Inventory Results in the Elementary Grades
Reading is a fundamental skill in our modern society; being able to read with comprehension and fluency is an important skill in all core academic subjects. Reading teachers are charged with the task to analyze student data in order to drive their instructional decisions. Informal Reading Inventories (IRIs) are one type of an informal reading assessment that teachers can use in the classroom to learn about student reading behaviors and drive instruction. Informal Reading Inventories assess fluency and comprehension. Research suggests that fluency and comprehension have a reciprocal relationship; meaning, if you improve one skill, you improve the other skill simultaneously (DeVries, 2011). This study explored how pre-service teachers, college students in an education program, and in-service teachers, veteran teachers, analyzed data from various IRIs. This study also explored how three separate IRIs, the Qualitative Reading Inventory (QRI), the Basic Reading Inventory (BRI), and the Analytical Reading Inventory (ARI), compared to one another. There were four participants in this study: two undergraduate students in an elementary education program reading class and two veteran classroom teachers. This study found that the grade level readability of the passages are inconsistent with the reading level they claim to be. An inconsistency like this is something to note as many teachers only use these resources on which they were trained during their college education. This study also found that the length of the IRI passages had an effect on the student’s words correct per minute (WCPM); the longer the passage, the lower the WCPM. This is probably due to the fact that students need more time to process a passage for the sake of comprehension
Automated Fixing of Programs with Contracts
This paper describes AutoFix, an automatic debugging technique that can fix
faults in general-purpose software. To provide high-quality fix suggestions and
to enable automation of the whole debugging process, AutoFix relies on the
presence of simple specification elements in the form of contracts (such as
pre- and postconditions). Using contracts enhances the precision of dynamic
analysis techniques for fault detection and localization, and for validating
fixes. The only required user input to the AutoFix supporting tool is then a
faulty program annotated with contracts; the tool produces a collection of
validated fixes for the fault ranked according to an estimate of their
suitability.
In an extensive experimental evaluation, we applied AutoFix to over 200
faults in four code bases of different maturity and quality (of implementation
and of contracts). AutoFix successfully fixed 42% of the faults, producing, in
the majority of cases, corrections of quality comparable to those competent
programmers would write; the used computational resources were modest, with an
average time per fix below 20 minutes on commodity hardware. These figures
compare favorably to the state of the art in automated program fixing, and
demonstrate that the AutoFix approach is successfully applicable to reduce the
debugging burden in real-world scenarios.Comment: Minor changes after proofreadin
- …