25,681 research outputs found

    Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation

    Get PDF
    Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions

    Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

    Full text link
    In this paper, we present that security threats coming with existing GPU memory management strategy are overlooked, which opens a back door for adversaries to freely break the memory isolation: they enable adversaries without any privilege in a computer to recover the raw memory data left by previous processes directly. More importantly, such attacks can work on not only normal multi-user operating systems, but also cloud computing platforms. To demonstrate the seriousness of such attacks, we recovered original data directly from GPU memory residues left by exited commodity applications, including Google Chrome, Adobe Reader, GIMP, Matlab. The results show that, because of the vulnerable memory management strategy, commodity applications in our experiments are all affected

    Accessing Textual Information Embedded in Internet Images

    No full text
    Indexing and searching for WWW pages is relying on analysing text. Current technology cannot process the text embedded in images on WWW pages. This paper argues that this is a significant problem as text in image form is usually semantically important (e.g. headers, titles). The results of a recent study are presented to show that the majority (76%) of words embedded in images do not appear elsewhere in the main text and that the majority (56%) of ALT tag descriptions of images are incorrect or do not exist at all. Research under way to devise tools to extract text from images based on the way humans perceive colour differences is outlined and results are presented

    PDF/A standard for long term archiving

    Get PDF
    PDF/A is defined by ISO 19005-1 as a file format based on PDF format. The standard provides a mechanism for representing electronic documents in a way that preserves their visual appearance over time, independent of the tools and systems used for creating or storing the files.Comment: 8 pages, exposed on 5th International Conference "Actualities and Perspectives on Hardware and Software" - APHS2009, Timisoara, Romani
    • …
    corecore