4,827 research outputs found

    Extracting text from PostScript

    Get PDF
    We show how to extract plain text from PostScript files. A textual scan is inadequate because PostScript interpreters can generate characters on the page that do not appear in the source file. Furthermore, word and line breaks are implicit in the graphical rendition, and must be inferred from the positioning of word fragments. We present a robust technique for extracting text and recognizing words and paragraphs. The method uses a standard PostScript interpreter but redefines several PostScript operators, and simple heuristics are employed to locate word and line breaks. The scheme has been used to create a full-text index, and plain-text versions, of 40,000 technical reports (34 Gbyte of PostScript). Other text-extraction systems are reviewed: none offer the same combination of robustness and simplicity

    Non-perturbative calculation of the probability distribution of plane-wave transmission through a disordered waveguide

    Get PDF
    A non-perturbative random-matrix theory is applied to the transmission of a monochromatic scalar wave through a disordered waveguide. The probability distributions of the transmittances T_{mn} and T_n=\sum_m T_{mn} of an incident mode n are calculated in the thick-waveguide limit, for broken time-reversal symmetry. A crossover occurs from Rayleigh or Gaussian statistics in the diffusive regime to lognormal statistics in the localized regime. A qualitatively different crossover occurs if the disordered region is replaced by a chaotic cavity. ***Submitted to Physical Review E.***Comment: 7 pages, REVTeX-3.0, 5 postscript figures appended as self-extracting archive. A complete postscript file with figures and text (4 pages) is available from http://rulgm4.LeidenUniv.nl/preprints.htm

    Extraction of the πNN\pi NN coupling constant from NN scattering data

    Full text link
    We reexamine Chew's method for extracting the πNN\pi NN coupling constant from np differential cross section measurements. Values for this coupling are extracted below 350 MeV, in the potential model region, and up to 1 GeV. The analyses to 1~GeV have utilized 55 data sets. We compare these results to those obtained via χ2\chi^2 mapping techniques. We find that these two methods give consistent results which are in agreement with previous Nijmegen determinations.Comment: 12 pages of text plus 2 figures. Revtex file and postscript figures available via anonymous FTP at ftp://clsaid.phys.vt.edu/pub/n

    Mapping and Displaying Structural Transformations between XML and PDF

    Get PDF
    Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving. Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'. This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window. Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document

    Copyright protection for the electronic distribution of text documents

    Get PDF
    Each copy of a text document can be made different in a nearly invisible way by repositioning or modifying the appearance of different elements of text, i.e., lines, words, or characters. A unique copy can be registered with its recipient, so that subsequent unauthorized copies that are retrieved can be traced back to the original owner. In this paper we describe and compare several mechanisms for marking documents and several other mechanisms for decoding the marks after documents have been subjected to common types of distortion. The marks are intended to protect documents of limited value that are owned by individuals who would rather possess a legal than an illegal copy if they can be distinguished. We will describe attacks that remove the marks and countermeasures to those attacks. An architecture is described for distributing a large number of copies without burdening the publisher with creating and transmitting the unique documents. The architecture also allows the publisher to determine the identity of a recipient who has illegally redistributed the document, without compromising the privacy of individuals who are not operating illegally. Two experimental systems are described. One was used to distribute an issue of the IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, and the second was used to mark copies of company private memoranda

    Quark propagator in a covariant gauge

    Get PDF
    Using mean--field improved gauge field configurations, we compare the results obtained for the quark propagator from Wilson fermions and Overlap fermions on a \3 lattice at a spacing of a=0.125(2)a=0.125(2) fm.Comment: 5 pages, 8 figures, talk given by F.D.R. Bonnet at LHP 2001 workshop, Cairns, Australi

    Automating Metadata Extraction: Genre Classification

    Get PDF
    A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined
    corecore