7 research outputs found

    OPTICAL CHARACTER RECOGNITION (OCR) METODE STRUKTUR MENGGUNAKAN EKSTRAKSI KARAKTERISTIK TITIK DAN VEKTOR

    Get PDF
    ABSTRAKSI: Optical Character Recognition (OCR) adalah sebuah sistem komputer yang digunakan secara otomatis mengenali serangkaian karakter yang berasal dari mesin ketik, mesin cetak ataupun tulisan tangan. Dengan kata lain OCR adalah proses pengalihan dokumen teks menjadi file komputer tanpa harus pengeditan ulang, setiap karakter baik huruf, kata, kalimat dapat dikenali secara tepat dan dibaca oleh perangkat lunak yang lain, tanpa harus pengetikan ulang dan editing. Pada tugas akhir ini dikembangkan suatu aplikasi untuk mengidentifikasi karakter pada suatu file gambar (bmp atau jpg) yang berisi karakter yang berasal dari pemindaian hardcopy atau dari sumber lainnya. Proses ekstraksi ciri menggunakan pendekatan vektor dan region. Pada proses tersebut akan ditentukan vektor penyusun garis karakter pada tiap area pengamatan (region), dimana tiap karakter dibagi menjadi 9 region yang sama besar dan simetris. Untuk mengevaluasi performansi dari OCR dengan menggunakan metode tersebut, dilakukan pengujian terhadap beberapa sampel masukan baik yang berasal dari dokumen hardcopy maupn yang berasal dari sumber lainnya. Hasil analisis menunjukkan bahwa sistem OCR ini mempunyai tingkat akurasi sebesar 91,88% untuk font yang sudah dilatihkan, dan 52,26% untuk font yang belum dilatihkan. Kata Kunci : Pengenal huruf otomatis , ekstraksi ciri , vektor , regionABSTRACT: Optical Character Recognition (OCR) is a computer system which is used automatically to recognize a part of character coming from typewriter, letterpress and or handwriting. In the other hand, OCR is a process of transferring the text document become the computer file without having to expurgation repeat, every characters such as letter, word, sentence can be recognized precisely and read by other software, without having to type repeating and editing In this final project will be developed an application to identify the character at one file picture (*.bmp or *.jpg) which contains character from hardcopy or other source. The extractions distinguish process using the approach of vector and region. At this process will be determined vector compiler mark with lines of character at every perception area, where each character divide into 9 regions that have same size and symmetries. To evaluate the performance of OCR by using that method, will be conducted an examination to some input samples which is coming from document of hardcopy or other source. The result shows that this OCR system have recognition rate 91,88 % in trained font, and 52,26% in non trained font.Keyword: Recognition , Extraction , vektor , regio

    Administrative Document Analysis and Structure

    Get PDF
    International audienceThis chapter reports our knowledge about the analysis and recognition of scanned administrative documents. Regarding essentially the administrative paper flow with new and continuous arrivals, all the conventional techniques reserved to static databases modeling and recognition are doomed to failure. For this purpose, a new technique based on the experience was investigated giving very promising results. This technique is related to the case-based reasoning already used in data mining and various problems of machine learning. After the presentation of the context related to the administrative document flow and its requirements in a real time processing, we present a case based reasonning for invoice processing. The case corresponds to the co-existence of a problem and its solution. The problem in an invoice corresponds to a local structure such as the keywords of an address or the line patterns in the amounts table, while the solution is related to their content. This problem is then compared to a document case base using graph probing. For this purpose, we proposed an improvement of an already existing neural network called Incremental Growing Neural Ga

    Semantic framework for regulatory compliance support

    Get PDF
    Regulatory Compliance Management (RCM) is a management process, which an organization implements to conform to regulatory guidelines. Some processes that contribute towards automating RCM are: (i) extraction of meaningful entities from the regulatory text and (ii) mapping regulatory guidelines with organisational processes. These processes help in updating the RCM with changes in regulatory guidelines. The update process is still manual since there are comparatively less research in this direction. The Semantic Web technologies are potential candidates in order to make the update process automatic. There are stand-alone frameworks that use Semantic Web technologies such as Information Extraction, Ontology Population, Similarities and Ontology Mapping. However, integration of these innovative approaches in the semantic compliance management has not been explored yet. Considering these two processes as crucial constituents, the aim of this thesis is to automate the processes of RCM. It proposes a framework called, RegCMantic. The proposed framework is designed and developed in two main phases. The first part of the framework extracts the regulatory entities from regulatory guidelines. The extraction of meaningful entities from the regulatory guidelines helps in relating the regulatory guidelines with organisational processes. The proposed framework identifies the document-components and extracts the entities from the document-components. The framework extracts important regulatory entities using four components: (i) parser, (ii) definition terms, (iii) ontological concepts and (iv) rules. The parsers break down a sentence into useful segments. The extraction is carried out by using the definition terms, ontological concepts and the rules in the segments. The entities extracted are the core-entities such as subject, action and obligation, and the aux-entities such as time, place, purpose, procedure and condition. The second part of the framework relates the regulatory guidelines with organisational processes. The proposed framework uses a mapping algorithm, which considers three types of Abstract 3 entities in the regulatory-domain and two types of entities in the process-domains. In the regulatory-domain, the considered entities are regulation-topic, core-entities and aux-entities. Whereas, in the process-domain, the considered entities are subject and action. Using these entities, it computes aggregation of three types of similarity scores: topic-score, core-score and aux-score. The aggregate similarity score determines whether a regulatory guideline is related to an organisational process. The RegCMantic framework is validated through the development of a prototype system. The prototype system implements a case study, which involves regulatory guidelines governing the Pharmaceutical industries in the UK. The evaluation of the results from the case-study has shown improved accuracy in extraction of the regulatory entities and relating regulatory guidelines with organisational processes. This research has contributed in extracting meaningful entities from regulatory guidelines, which are provided in unstructured text and mapping the regulatory guidelines with organisational processes semantically

    Graphical tools for ground truth generation in HTR tasks

    Full text link
    [EN] This report will cover the development of several graphical tools for ground truth generation in HTR tasks, specifically for layout analysis, line segmentation, and transcription, as well as one ad hoc tool needed for point classification in an implemented line size normalization method. It will show the design process behind the tools, giving an overview of the internal structure through class diagrams. It will also explain the mentioned phases of the HTR with the aim of clarifying each tool context and utility. Finally, the report will close with a brief conclusions and considerations about the future of the tools.[CA] Aquest informe cobrirà el desenvolupament de diverses ferramentes gràfiques utilitzades en la generació de ground truth en tasques de reconeixement de text manuscrit (HTR), especificament anàlisi de layout, segmentació en línies i transcripció, així com una ferramenta ad hoc requerida per a la classificació de punts necessària en un mètode de normalització de tamany de línia que vam implementar. Mostrarà el procès de disseny previ al desenvolupament de les ferramentes, donant una visió general de l'estructura interna a través de diagrames de classe. També explicarà les diferents fases del procès de HTR previament mencionades, amb l'intenció de clarificar el context i l'utilitat de les diferents ferramentes. Finalment, l'informe acabarà amb unes breus conclussions i algunes consideracions sobre el futur de les ferramentes.Martínez Vargas, J. (2014). Graphical tools for ground truth generation in HTR tasks. http://hdl.handle.net/10251/36156.Archivo delegad

    Adaptive Analysis and Processing of Structured Multilingual Documents

    Get PDF
    Digital document processing is becoming popular for application to office and library automation, bank and postal services, publishing houses and communication management. In recent years, the demand for tools capable of searching written and spoken sources of multilingual information has increased tremendously, where the bilingual dictionary is one of the important resource to provide the required information. Processing and analysis of bilingual dictionaries brought up the challenges of dealing with many different scripts, some of which are unknown to the designer. A framework is presented to adaptively analyze and process structured multilingual documents, where adaptability is applied to every step. The proposed framework involves: (1) General word-level script identification using Gabor filter. (2) Font classification using the grating cell operator. (3) General word-level style identification using Gaussian mixture model. (4) An adaptable Hindi OCR based on generalized Hausdorff image comparison. (5) Retargetable OCR with automatic training sample creation and its applications to different scripts. (6) Bootstrapping entry segmentation, which segments each page into functional entries for parsing. Experimental results working on different scripts, such as Chinese, Korean, Arabic, Devanagari, and Khmer, demonstrate that the proposed framework can save human efforts significantly by making each phase adaptive

    Automated labeling in document images

    No full text
    The National Library of Medicine (NLM) is developing an automated system to produce bibliographic records for its MEDLINE ® database. This system, named Medical Article Record System (MARS), employs document image analysis and understanding techniques and optical character recognition (OCR). This paper describes a key module in MARS called the Automated Labeling (AL) module, which labels all zones of interest (title, author, affiliation, and abstract) automatically. The AL algorithm is based on 120 rules that are derived from an analysis of journal page layouts and features extracted from OCR output. Experiments carried out on more than 11,000 articles in over 1,000 biomedical journals show the accuracy of this rule-based algorithm to exceed 96%

    <title>Automated labeling in document images</title>

    No full text
    corecore