15 research outputs found

    JBIG2 Supported by OCR

    Get PDF
    Digital Mathematical libraries contain a large volume of PDF documents containing scanned text. In this paper, we describe how this documents can be compressed and thus provide them more effectively to the users. We introduce a JBIG2 standard for compressing bitonal images such as scanned text and we discuss issues if OCR is used for improving the compression ratio of jbig2enc open-source encoder. For this purpose, we have designed API for using OCR in jbig2enc which we describe in this paper together with already achieved results.Digitální matematické knihovnz obsahují velké množství PDF dokumentů obsahujících skenovaný text. V tomto článku popisujeme, jakým způsobem mohou být takové dokumenty komprimovány, a tím pádem poskytovány uživateli efektivnější cestou. Za tímto účelem představujeme JBIG2 standard pro kompresi bitonálních obrázků (např. naskenovaný text) a diskutujeme přínosy a problémy použití OCR za účelem zvýšení komprese volně šiřitelného jbig2enc enkodéru. Za tímto účelem jsme navrhli a implementovali rozhraní pro používání OCR v jbig2enc enkodéru, které zde popisujeme spolu s předběžnými výsledky.Digital Mathematical libraries contain a large volume of PDF documents containing scanned text. In this paper, we describe how this documents can be compressed and thus provide them more effectively to the users. We introduce a JBIG2 standard for compressing bitonal images such as scanned text and we discuss issues if OCR is used for improving the compression ratio of jbig2enc open-source encoder. For this purpose, we have designed API for using OCR in jbig2enc which we describe in this paper together with already achieved results

    EFFICIENT IMAGE COMPRESSION AND DECOMPRESSION ALGORITHMS FOR OCR SYSTEMS

    Get PDF
    This paper presents an efficient new image compression and decompression methods for document images, intended for usage in the pre-processing stage of an OCR system designed for needs of the “Nikola Tesla Museum” in Belgrade. Proposed image compression methods exploit the Run-Length Encoding (RLE) algorithm and an algorithm based on document character contour extraction, while an iterative scanline fill algorithm is used for image decompression. Image compression and decompression methods are compared with JBIG2 and JPEG2000 image compression standards. Segmentation accuracy results for ground-truth documents are obtained in order to evaluate the proposed methods. Results show that the proposed methods outperform JBIG2 compression regarding the time complexity, providing up to 25 times lower processing time at the expense of worse compression ratio results, as well as JPEG2000 image compression standard, providing up to 4-fold improvement in compression ratio. Finally, time complexity results show that the presented methods are sufficiently fast for a real time character segmentation system

    Test Segmentation of MRC Document Compression and Decompression by Using MATLAB

    Get PDF
    Abstract-The mixed raster content (MRC) standard specifies a framework for document compression which can dramatically improve the compression/ quality tradeoff as compared to traditional lossy image compression algorithms. The key to MRC compression is the separation of the document into foreground and background layers, represented as a binary mask. Therefore, the resulting quality and compression ratio of a MRC document encoder is highly dependent upon the segmentation algorithm used to compute the binary mask. The incorporated multi scale framework is used in order to improve the segmentation accuracy of text with varying size. In this paper, we propose a novel multi scale segmentation scheme for MRC document encoding based on the sequential application of two algorithms. The first algorithm, cost optimized segmentation (COS), is a block wise segmentation algorithm formulated in a global cost optimization framework. The second algorithm, connected component classification (CCC), refines the initial segmentation by classifying feature vectors of connected components using a Markov random field (MRF) model. The combined COS/CCC segmentation algorithms are then incorporated into a multi scale framework in order to improve the segmentation accuracy of text with varying size

    CONTENT BASED INFORMATION RETRIEVAL FOR DIGITAL LIBRARY USING DOCUMENT IMAGE

    Get PDF
    In the recent year, the using of mobile devices has perceive an emerging need for improving the user experience of digital library for search, with various applications such as education, location search and product retrieval, There simply compare the query to the databases images; those are match that images are retrieve from the database, searching and response time of delivery staying a challenging issues in mobile document search previously lots of work has been done on search engine, retrieving the document from the database without analyzed the image. In The proposed method, Information retrieval for image based query automatically with a mobile document information retrieval framework, consisting of a FP-growth is proposed finding frequent pattern from the retrieve document to optimize the result

    Method for Effective PDF Files Manipulation Detection

    Get PDF
    Käesoleva magistritöö eesmärgiks on lihtsustada PDF failides tehtud muudatuste tuvastamise protsessi kasutades faili lähtekoodi enne, kui liigutakse edasi teiste meetodite juurde nagu näiteks pilditöötlus. Lähtekoodi analüüs on mõeldud esimeseks sammuks, mis võimaldab säästa palju uurijate aega ning pakkuda rohkem tõestusmaterjali muudatuste tegemise kohta asitõendiks oleva digitaalse faili kohta. Magistritöö tulemusel valmib põhjalik ja efektiivne metoodika PDF failide terviklikkuse uurimiseks ja analüüsimiseks. Püstitatud eesmärgi saavutamiseks õpitakse kõigepealt tundma PDF faili ehitust mõistmaks faili struktuuri ja komponente. Seejärel tehakse ridamisi muudatusi faili lähtekoodis, mis võimaldab süveneda faili varjatud külgedesse ja leida haavatavaid kohti ning millest saadav informatsioon on abiks metoodika aluste paika panemisel. Failide enamlevinud muutmise tüüpide uurimisel saadakse kogum andmeid, millede suhtes hakatakse võrdlema uurimise all olevaid faile ning seeläbi testitakse faili tõepärasust. Lisaks otsitakse vabavaralisi tarkvarasid, millega antud ülesannet lahendada. Töö lõpetatakse kontrollkatsetega, sealhulgas hinnatakse saadud tulemusi ja märgitakse ära tuleviku tegevussuunad antud valdkonnas.The aim of this thesis is to ease the process of detecting manipulations in PDF files by addressing its source code, before having to use other methods such as image processing or text-line examination. It is intended to be a previous step to tackle, which can save a lot of time to examiners and provide them with more proof of manipulations regarding digital file evidence. The result is the construction of a solid and effective method for PDF file investigation and analysis to determine its integrity. To achieve this goal, a study of PDF file anatomy will be conducted firstly, in order to become familiar with the structure and composition of this file format. Afterwards, a series of manipulations performed directly against the file source code will deepen in its secrets and vulnerabilities, and will therefore help in setting the foundations for the method. Finally, a study on the most common types of file manipulations will lead to a set of layouts to which compare the files under investigation and thus, test its veracity, complemented with a quest for specialised open source tools to accomplish this task; a set of validation experiments will complete the work, evaluating the obtained results and stating future lines of work in this field

    Scanned Document Compression Technique

    Get PDF
    These days’ different media records are utilized to impart data. The media documents are content records, picture, sound, video and so forth. All these media documents required substantial measure of spaces when it is to be exchanged. Regular five page report records involve 75 KB of space, though a solitary picture can take up around 1.4 MB. In our paper, fundamental center is on two pressure procedures which are named as DjVU pressure strategy and the second is Block-based Hybrid Video Codec. In which we will chiefly concentrate on DjVU pressure strategy. DjVu is a picture pressure procedure particularly equipped towards the pressure of checked records in shading at high determination. Run of the mill magazine pages in shading filtered at 300dpi are compacted to somewhere around 40 and 80 KB, or 5 to 10 times littler than with JPEG for a comparative level of subjective quality. The frontal area layer, which contains the content and drawings and requires high spatial determination, is isolated from the foundation layer, which contains pictures and foundations and requires less determination. The closer view is packed with a bi-tonal picture pressure system that exploits character shape similitudes. The foundation is compacted with another dynamic, wavelet-based pressure strategy. A constant, memory proficient variant of the decoder is accessible as a module for famous web programs. We likewise exhibit that the proposed division calculation can enhance the nature of decoded reports while at the same time bringing down the bit rate

    From Pixels and Minds to the Mathematical Knowledge in a Digital Library

    Get PDF
    summary:Experience in setting up a workflow from scanned images of mathematical papers into a fully fledged mathematical library is described on the example of the project Czech Digital Mathematics Library DML-CZ. An overview of the whole process is given, with description of all main production steps. DML-CZ has recently been launched to public with more than 100,000 digitized pages

    Compression Of 2-Tone Manuscript For Multimedia Application [QA76.9.D33 B171 2008 f rb].

    Get PDF
    Malaysia seperti negara lain kaya dengan dokumen lama berlandaskan unsur sejarah dan kebudayaan yang jarang ditemui. Malaysia like any other country has old and rare documents that depict its history and culture
    corecore