11,130 research outputs found

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    A literature survey of methods for analysis of subjective language

    Get PDF
    Subjective language is used to express attitudes and opinions towards things, ideas and people. While content and topic centred natural language processing is now part of everyday life, analysis of subjective aspects of natural language have until recently been largely neglected by the research community. The explosive growth of personal blogs, consumer opinion sites and social network applications in the last years, have however created increased interest in subjective language analysis. This paper provides an overview of recent research conducted in the area

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Text Line Extraction in Handwritten Document with Kalman Filter Applied on Low Resolution Image

    No full text
    International audienceIn this paper we present a method to extract text lines in handwritten documents. Indeed, line extraction is a first interesting step in document structure recognition. Our method is based on a notion of perceptive vision: at a certain distance, text lines of documents can be seen as line segments. Therefore, we propose to detect text line using a line segment extractor on low resolution images. We present our extractor based on the theory of Kalman filtering. Our method makes it possible to deal with difficulties met in ancient damaged documents: skew, curved lines, overlapping text lines. . .We present results on archive documents from the 18th and 19th century

    Structuring visual exploratory analysis of skill demand

    No full text
    The analysis of increasingly large and diverse data for meaningful interpretation and question answering is handicapped by human cognitive limitations. Consequently, semi-automatic abstraction of complex data within structured information spaces becomes increasingly important, if its knowledge content is to support intuitive, exploratory discovery. Exploration of skill demand is an area where regularly updated, multi-dimensional data may be exploited to assess capability within the workforce to manage the demands of the modern, technology- and data-driven economy. The knowledge derived may be employed by skilled practitioners in defining career pathways, to identify where, when and how to update their skillsets in line with advancing technology and changing work demands. This same knowledge may also be used to identify the combination of skills essential in recruiting for new roles. To address the challenges inherent in exploring the complex, heterogeneous, dynamic data that feeds into such applications, we investigate the use of an ontology to guide structuring of the information space, to allow individuals and institutions to interactively explore and interpret the dynamic skill demand landscape for their specific needs. As a test case we consider the relatively new and highly dynamic field of Data Science, where insightful, exploratory data analysis and knowledge discovery are critical. We employ context-driven and task-centred scenarios to explore our research questions and guide iterative design, development and formative evaluation of our ontology-driven, visual exploratory discovery and analysis approach, to measure where it adds value to users’ analytical activity. Our findings reinforce the potential in our approach, and point us to future paths to build on

    State space collapse and diffusion approximation for a network operating under a fair bandwidth sharing policy

    Full text link
    We consider a connection-level model of Internet congestion control, introduced by Massouli\'{e} and Roberts [Telecommunication Systems 15 (2000) 185--201], that represents the randomly varying number of flows present in a network. Here, bandwidth is shared fairly among elastic document transfers according to a weighted α\alpha-fair bandwidth sharing policy introduced by Mo and Walrand [IEEE/ACM Transactions on Networking 8 (2000) 556--567] [α∈(0,∞)\alpha\in (0,\infty)]. Assuming Poisson arrivals and exponentially distributed document sizes, we focus on the heavy traffic regime in which the average load placed on each resource is approximately equal to its capacity. A fluid model (or functional law of large numbers approximation) for this stochastic model was derived and analyzed in a prior work [Ann. Appl. Probab. 14 (2004) 1055--1083] by two of the authors. Here, we use the long-time behavior of the solutions of the fluid model established in that paper to derive a property called multiplicative state space collapse, which, loosely speaking, shows that in diffusion scale, the flow count process for the stochastic model can be approximately recovered as a continuous lifting of the workload process.Comment: Published in at http://dx.doi.org/10.1214/08-AAP591 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Extraction of Scores and Average From Algerian High-School Degree Transcripts

    Get PDF
    A system for extracting scores and average from Algerian High School Degree Transcripts is proposed. The system extracts the scores and the average based on the localization of the tables gathering this information and it consists of several stages. After preprocessing, the system locates the tables using ruling-lines information as well as other text information. Therefore, the adopted localization approach can work even in the absence of certain ruling-lines or the erasure and discontinuity of lines. After that, the localized tables are segmented into columns and the columns into information cells. Finally, cells labeling is done based on the prior knowledge of the tables structure allowing to identify the scores and the average. Experiments have been conducted on a local dataset in order to evaluate the performances of our system and compare it with three public systems at three levels, and the obtained results show the effectiveness of our system
    • 

    corecore