6 research outputs found
Normalisasi Tampilan dan Tata Letak Tampilan Citra Dengan Algorima Adaktif Splitter
Abstrak Dokumen gambar saat ditampilkan di monitor, melalui World Wide Web mengalami penurunan resolusi citra RGB apabila resolusi citranya besar. Sistem yang akan di kembangkan yaitu system secara otomatis melakukan splitter terhadap image dengan adaktif hyperdocument. Citra akan dapat diakses dengan cepat melalui internet, dalam format seperti HTML dan SGML/XML, juga telah dengan cepat meningkat. Namun demikian, hanya beberapa pekerjaan telah dilakukan pada konversi dari dokumen ke hyperdocument. Selain itu, sebagian besar citra yang di olah yaitu mencakup teks dan gambar objek. Adapun metode adaktif yang digunakan untuk mengkonversi lebih komplek multi kolom dokumen citra ke dalam dokumen HTML, dan metode normalisasi adaktif untuk menghasilkan sebuah tabel yang terstruktur untuk menampikan halaman isi. Berdasarkan analisis struktur logis dokumen gambar. Percobaan dengan berbagai macam multi kolom dokumen citra menunjukkan bahwa, dengan menggunakan metode yang diusulkan, dalam dokumen HTML yang sesuai dapat dihasilkan di atur tata letak visual yang sama sebagai bahwa gambar dokumen, dan Tabel Halaman isi mereka terstruktur dapat juga diproduksi dengan hirarki memerintahkan judul pasal hyper link ke isi.Kata kunci: hyperdocumen, Multi kolom dokumen, Konversi dokumen, spliter zoom; Dokumen Logis, Analisis strukturAbstrak The image data is displayed in the monitor is process in the World Wide Web has decreased resolution RGB image if the resolution of the image Image Data displayed on the monitor via the Web, RGB resolution has decreased and the read and write process data is very long times.Research that is developing a system that is able to read the data and do a splitter to images with adaktif hyper document. The result image Data splitter will automatically be fast in access over the web, because the data in formats such as HTML and SGML/XML. The conversion of documents into hyper document will do the process of text and image data. Hyper document with adaktif method is able to convert more complex multi-column document images into HTML documents by performing the normalization of adaktif, generate a table of contents page is structured. Based on the logical structure analysis and display the document image. Experiment with a variety of multi column indicates that document the result image splitter: 8 x 8, 16x16, up to 200 x 200, The conversion of images into HTML documents, are able to be displayed with the average 0.1- 0.17 second.keywords: image splitter, hyper document, multi-column documents, splitter zoom, logical documen
Eyes Wide Open: an interactive learning method for the design of rule-based systems
International audienceWe present in this paper a new general method, the Eyes Wide Open method (EWO) for the design of rule-based document recognition systems. Our contribution is to introduce a learning procedure, through machine learning techniques, in interaction with the user to design the recognition system. Therefore, and unlike many approaches that are manually designed, ours can easily adapt to a new type of documents while taking advantage of the expressiveness of rule-based systems and their ability to convey the hierarchical structure of a document. The EWO method is independent of any existing recognition system. An automatic analysis of an annotated corpus, guided by the user, is made to help the adaption of the recognition system to a new kind of document. The user will then bring sense to the automatically extracted information. In this paper, we validate EWO by producing two rule-based systems: one for the Mau-rdor international competition, on a heterogeneous corpus of documents, containing handwritten and printed documents, written in different languages and another one for the RIMES competition corpus, a homogeneous corpus of French handwritten business letters. On the RIMES corpus, our method allows an assisted design of a grammatical description that gives better results than all the previously proposed statistical systems
Graphical tools for ground truth generation in HTR tasks
[EN] This report will cover the development of several graphical tools for ground
truth generation in HTR tasks, specifically for layout analysis, line segmentation,
and transcription, as well as one ad hoc tool needed for point classification
in an implemented line size normalization method. It will show the
design process behind the tools, giving an overview of the internal structure
through class diagrams. It will also explain the mentioned phases of the HTR
with the aim of clarifying each tool context and utility. Finally, the report
will close with a brief conclusions and considerations about the future of the
tools.[CA] Aquest informe cobrirà el desenvolupament de diverses ferramentes gràfiques
utilitzades en la generació de ground truth en tasques de reconeixement de
text manuscrit (HTR), especificament anàlisi de layout, segmentació en línies
i transcripció, així com una ferramenta ad hoc requerida per a la classificació
de punts necessària en un mètode de normalització de tamany de línia que
vam implementar. Mostrarà el procès de disseny previ al desenvolupament
de les ferramentes, donant una visió general de l'estructura interna a través
de diagrames de classe. També explicarà les diferents fases del procès de
HTR previament mencionades, amb l'intenció de clarificar el context i l'utilitat
de les diferents ferramentes. Finalment, l'informe acabarà amb unes
breus conclussions i algunes consideracions sobre el futur de les ferramentes.Martínez Vargas, J. (2014). Graphical tools for ground truth generation in HTR tasks. http://hdl.handle.net/10251/36156.Archivo delegad
Semantic framework for regulatory compliance support
Regulatory Compliance Management (RCM) is a management process, which an organization
implements to conform to regulatory guidelines. Some processes that contribute towards
automating RCM are: (i) extraction of meaningful entities from the regulatory text and (ii)
mapping regulatory guidelines with organisational processes. These processes help in updating
the RCM with changes in regulatory guidelines. The update process is still manual since there
are comparatively less research in this direction. The Semantic Web technologies are potential
candidates in order to make the update process automatic. There are stand-alone frameworks
that use Semantic Web technologies such as Information Extraction, Ontology Population,
Similarities and Ontology Mapping. However, integration of these innovative approaches in
the semantic compliance management has not been explored yet. Considering these two
processes as crucial constituents, the aim of this thesis is to automate the processes of RCM. It
proposes a framework called, RegCMantic.
The proposed framework is designed and developed in two main phases. The first part of the
framework extracts the regulatory entities from regulatory guidelines. The extraction of
meaningful entities from the regulatory guidelines helps in relating the regulatory guidelines
with organisational processes. The proposed framework identifies the document-components
and extracts the entities from the document-components. The framework extracts important
regulatory entities using four components: (i) parser, (ii) definition terms, (iii) ontological
concepts and (iv) rules. The parsers break down a sentence into useful segments. The
extraction is carried out by using the definition terms, ontological concepts and the rules in the
segments. The entities extracted are the core-entities such as subject, action and obligation, and
the aux-entities such as time, place, purpose, procedure and condition.
The second part of the framework relates the regulatory guidelines with organisational
processes. The proposed framework uses a mapping algorithm, which considers three types of
Abstract
3
entities in the regulatory-domain and two types of entities in the process-domains. In the
regulatory-domain, the considered entities are regulation-topic, core-entities and aux-entities.
Whereas, in the process-domain, the considered entities are subject and action. Using these
entities, it computes aggregation of three types of similarity scores: topic-score, core-score and
aux-score. The aggregate similarity score determines whether a regulatory guideline is related
to an organisational process.
The RegCMantic framework is validated through the development of a prototype system. The
prototype system implements a case study, which involves regulatory guidelines governing the
Pharmaceutical industries in the UK. The evaluation of the results from the case-study has
shown improved accuracy in extraction of the regulatory entities and relating regulatory
guidelines with organisational processes. This research has contributed in extracting
meaningful entities from regulatory guidelines, which are provided in unstructured text and
mapping the regulatory guidelines with organisational processes semantically
Interprétation contextuelle et assistée de fonds d'archives numérisées (application à des registres de ventes du XVIIIe siècle)
Les fonds d'archives forment de grandes quantités de documents difficiles à interpréter automatiquement : les approches classiques imposent un lourd effort de conception, sans parvenir à empêcher la production d'erreurs qu'il faut corriger après les traitements.Face à ces limites, notre travail vise à améliorer la processus d'interprétation, en conservant un fonctionnement page par page, et en lui apportant des informations contextuelles extraites du fonds documentaire ou fournies par des opérateurs humains.Nous proposons une extension ciblée de la description d'une page qui permet la mise en place systématique d'échanges entre le processus d'interprétation et son environnement. Un mécanisme global itératif gère l'apport progressif d'informations contextuelles à ce processus, ce qui améliore l'interprétation.L'utilisation de ces nouveaux outils pour le traitement de documents du XVIIIe siècle a montré qu'il était facile d'intégrer nos propositions à un système existant, que sa conception restait simple, et que l'effort de correction pouvait être diminué.Fonds, also called historical document collections, are important amounts of digitized documents which are difficult to interpret automatically: usual approaches require a lot of work during design, but do not manage to avoid producing many errors which have to be corrected after processing.To cope with those limitations, our work aimed at improving the interpretation process by making use of information extracted from the fond, or provided by human operators, while keeping a page by page processing.We proposed a simple extension of page description language which permits to automatically generate information exchange between the interpretation process and its environment. A global iterative mechanism progressively brings contextual information to the later process, and improves interpretation.Experiments and application of those new tools for the processing of documents from the 18th century showed that our propositions were easy to integrate in an existing system, that its design is still simple, and that required manual corrections were reduced.RENNES-INSA (352382210) / SudocSudocFranceF