7 research outputs found
OPTICAL CHARACTER RECOGNITION (OCR) METODE STRUKTUR MENGGUNAKAN EKSTRAKSI KARAKTERISTIK TITIK DAN VEKTOR
ABSTRAKSI: Optical Character Recognition (OCR) adalah sebuah sistem komputer yang digunakan secara otomatis mengenali serangkaian karakter yang berasal dari mesin ketik, mesin cetak ataupun tulisan tangan. Dengan kata lain OCR adalah proses pengalihan dokumen teks menjadi file komputer tanpa harus pengeditan ulang, setiap karakter baik huruf, kata, kalimat dapat dikenali secara tepat dan dibaca oleh perangkat lunak yang lain, tanpa harus pengetikan ulang dan editing. Pada tugas akhir ini dikembangkan suatu aplikasi untuk mengidentifikasi karakter pada suatu file gambar (bmp atau jpg) yang berisi karakter yang berasal dari pemindaian hardcopy atau dari sumber lainnya. Proses ekstraksi ciri menggunakan pendekatan vektor dan region. Pada proses tersebut akan ditentukan vektor penyusun garis karakter pada tiap area pengamatan (region), dimana tiap karakter dibagi menjadi 9 region yang sama besar dan simetris. Untuk mengevaluasi performansi dari OCR dengan menggunakan metode tersebut, dilakukan pengujian terhadap beberapa sampel masukan baik yang berasal dari dokumen hardcopy maupn yang berasal dari sumber lainnya. Hasil analisis menunjukkan bahwa sistem OCR ini mempunyai tingkat akurasi sebesar 91,88% untuk font yang sudah dilatihkan, dan 52,26% untuk font yang belum dilatihkan. Kata Kunci : Pengenal huruf otomatis , ekstraksi ciri , vektor , regionABSTRACT: Optical Character Recognition (OCR) is a computer system which is used automatically to recognize a part of character coming from typewriter, letterpress and or handwriting. In the other hand, OCR is a process of transferring the text document become the computer file without having to expurgation repeat, every characters such as letter, word, sentence can be recognized precisely and read by other software, without having to type repeating and editing In this final project will be developed an application to identify the character at one file picture (*.bmp or *.jpg) which contains character from hardcopy or other source. The extractions distinguish process using the approach of vector and region. At this process will be determined vector compiler mark with lines of character at every perception area, where each character divide into 9 regions that have same size and symmetries. To evaluate the performance of OCR by using that method, will be conducted an examination to some input samples which is coming from document of hardcopy or other source. The result shows that this OCR system have recognition rate 91,88 % in trained font, and 52,26% in non trained font.Keyword: Recognition , Extraction , vektor , regio
Administrative Document Analysis and Structure
International audienceThis chapter reports our knowledge about the analysis and recognition of scanned administrative documents. Regarding essentially the administrative paper flow with new and continuous arrivals, all the conventional techniques reserved to static databases modeling and recognition are doomed to failure. For this purpose, a new technique based on the experience was investigated giving very promising results. This technique is related to the case-based reasoning already used in data mining and various problems of machine learning. After the presentation of the context related to the administrative document flow and its requirements in a real time processing, we present a case based reasonning for invoice processing. The case corresponds to the co-existence of a problem and its solution. The problem in an invoice corresponds to a local structure such as the keywords of an address or the line patterns in the amounts table, while the solution is related to their content. This problem is then compared to a document case base using graph probing. For this purpose, we proposed an improvement of an already existing neural network called Incremental Growing Neural Ga
Semantic framework for regulatory compliance support
Regulatory Compliance Management (RCM) is a management process, which an organization
implements to conform to regulatory guidelines. Some processes that contribute towards
automating RCM are: (i) extraction of meaningful entities from the regulatory text and (ii)
mapping regulatory guidelines with organisational processes. These processes help in updating
the RCM with changes in regulatory guidelines. The update process is still manual since there
are comparatively less research in this direction. The Semantic Web technologies are potential
candidates in order to make the update process automatic. There are stand-alone frameworks
that use Semantic Web technologies such as Information Extraction, Ontology Population,
Similarities and Ontology Mapping. However, integration of these innovative approaches in
the semantic compliance management has not been explored yet. Considering these two
processes as crucial constituents, the aim of this thesis is to automate the processes of RCM. It
proposes a framework called, RegCMantic.
The proposed framework is designed and developed in two main phases. The first part of the
framework extracts the regulatory entities from regulatory guidelines. The extraction of
meaningful entities from the regulatory guidelines helps in relating the regulatory guidelines
with organisational processes. The proposed framework identifies the document-components
and extracts the entities from the document-components. The framework extracts important
regulatory entities using four components: (i) parser, (ii) definition terms, (iii) ontological
concepts and (iv) rules. The parsers break down a sentence into useful segments. The
extraction is carried out by using the definition terms, ontological concepts and the rules in the
segments. The entities extracted are the core-entities such as subject, action and obligation, and
the aux-entities such as time, place, purpose, procedure and condition.
The second part of the framework relates the regulatory guidelines with organisational
processes. The proposed framework uses a mapping algorithm, which considers three types of
Abstract
3
entities in the regulatory-domain and two types of entities in the process-domains. In the
regulatory-domain, the considered entities are regulation-topic, core-entities and aux-entities.
Whereas, in the process-domain, the considered entities are subject and action. Using these
entities, it computes aggregation of three types of similarity scores: topic-score, core-score and
aux-score. The aggregate similarity score determines whether a regulatory guideline is related
to an organisational process.
The RegCMantic framework is validated through the development of a prototype system. The
prototype system implements a case study, which involves regulatory guidelines governing the
Pharmaceutical industries in the UK. The evaluation of the results from the case-study has
shown improved accuracy in extraction of the regulatory entities and relating regulatory
guidelines with organisational processes. This research has contributed in extracting
meaningful entities from regulatory guidelines, which are provided in unstructured text and
mapping the regulatory guidelines with organisational processes semantically
Graphical tools for ground truth generation in HTR tasks
[EN] This report will cover the development of several graphical tools for ground
truth generation in HTR tasks, specifically for layout analysis, line segmentation,
and transcription, as well as one ad hoc tool needed for point classification
in an implemented line size normalization method. It will show the
design process behind the tools, giving an overview of the internal structure
through class diagrams. It will also explain the mentioned phases of the HTR
with the aim of clarifying each tool context and utility. Finally, the report
will close with a brief conclusions and considerations about the future of the
tools.[CA] Aquest informe cobrirà el desenvolupament de diverses ferramentes gràfiques
utilitzades en la generació de ground truth en tasques de reconeixement de
text manuscrit (HTR), especificament anàlisi de layout, segmentació en línies
i transcripció, així com una ferramenta ad hoc requerida per a la classificació
de punts necessària en un mètode de normalització de tamany de línia que
vam implementar. Mostrarà el procès de disseny previ al desenvolupament
de les ferramentes, donant una visió general de l'estructura interna a través
de diagrames de classe. També explicarà les diferents fases del procès de
HTR previament mencionades, amb l'intenció de clarificar el context i l'utilitat
de les diferents ferramentes. Finalment, l'informe acabarà amb unes
breus conclussions i algunes consideracions sobre el futur de les ferramentes.Martínez Vargas, J. (2014). Graphical tools for ground truth generation in HTR tasks. http://hdl.handle.net/10251/36156.Archivo delegad
Adaptive Analysis and Processing of Structured Multilingual Documents
Digital document processing is becoming popular for application to office and library automation, bank and postal services, publishing houses and communication management. In recent years, the demand for tools capable of searching written and spoken sources of multilingual information has increased tremendously, where the bilingual dictionary is one of the important resource to provide the required information. Processing and analysis of bilingual dictionaries brought up the challenges of dealing with many different scripts, some of which are unknown to the designer.
A framework is presented to adaptively analyze and process structured multilingual documents, where adaptability is applied to every step. The proposed framework involves:
(1) General word-level script identification using Gabor filter.
(2) Font classification using the grating cell operator.
(3) General word-level style identification using Gaussian mixture model.
(4) An adaptable Hindi OCR based on generalized Hausdorff image comparison.
(5) Retargetable OCR with automatic training sample creation and its applications to different scripts.
(6) Bootstrapping entry segmentation, which segments each page into functional entries for parsing.
Experimental results working on different scripts, such as Chinese, Korean, Arabic, Devanagari, and Khmer, demonstrate that the proposed framework can save human efforts significantly by making each phase adaptive
Automated labeling in document images
The National Library of Medicine (NLM) is developing an automated system to produce bibliographic records for its MEDLINE ® database. This system, named Medical Article Record System (MARS), employs document image analysis and understanding techniques and optical character recognition (OCR). This paper describes a key module in MARS called the Automated Labeling (AL) module, which labels all zones of interest (title, author, affiliation, and abstract) automatically. The AL algorithm is based on 120 rules that are derived from an analysis of journal page layouts and features extracted from OCR output. Experiments carried out on more than 11,000 articles in over 1,000 biomedical journals show the accuracy of this rule-based algorithm to exceed 96%