7 research outputs found

    Logical segmentation for article extraction in digitized old newspapers

    Full text link
    Newspapers are documents made of news item and informative articles. They are not meant to be red iteratively: the reader can pick his items in any order he fancies. Ignoring this structural property, most digitized newspaper archives only offer access by issue or at best by page to their content. We have built a digitization workflow that automatically extracts newspaper articles from images, which allows indexing and retrieval of information at the article level. Our back-end system extracts the logical structure of the page to produce the informative units: the articles. Each image is labelled at the pixel level, through a machine learning based method, then the page logical structure is constructed up from there by the detection of structuring entities such as horizontal and vertical separators, titles and text lines. This logical structure is stored in a METS wrapper associated to the ALTO file produced by the system including the OCRed text. Our front-end system provides a web high definition visualisation of images, textual indexing and retrieval facilities, searching and reading at the article level. Articles transcriptions can be collaboratively corrected, which as a consequence allows for better indexing. We are currently testing our system on the archives of the Journal de Rouen, one of France eldest local newspaper. These 250 years of publication amount to 300 000 pages of very variable image quality and layout complexity. Test year 1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012

    Segmentation logique d'images de journaux anciens

    No full text
    International audienceLes journaux anciens sont des documents riches et complexes, reprĂ©sentant un gisement d'informations pour le lecteur, ainsi qu'un dĂ©fi pour la communautĂ© des chercheurs en analyse de document. En effet leur structure complexe nĂ©cessite de mettre en place des techniques avancĂ©es afin de mieux valoriser leur valeur documentaire. Outre les multiples dĂ©gradations et dĂ©formations du support, ces documents possĂšdent une grande variabilitĂ© de mise en page. Nous tentons d'apporter une rĂ©ponse Ă  ces difficultĂ©s en prĂ©sentant dans cet article une mĂ©thode destinĂ©e Ă  la segmentation d'articles dans des journaux anciens. Cette tĂąche est accomplie Ă  l'aide d'un modĂšle de Champs AlĂ©atoires Conditionnels permettant d'Ă©tiqueter les zones d'intĂ©rĂȘt avec un attribut logique. Ces Ă©lĂ©ments d'intĂ©rĂȘt sont ensuite analysĂ©s afin de dĂ©terminer la structure et l'ordre logique des articles. La mĂ©thode repose sur la gĂ©nĂ©ration d'une grille de sĂ©paration inter articles que l'on applique sur le document de maniĂšre rĂ©cursive, ce qui permet d'apprĂ©hender n'importe quel type de mise en page. Les rĂ©sultats de cette mĂ©thode sont Ă©valuĂ©s sur une base d'images issues du fond du Journal de Rouen. Cette mĂ©thode est intĂ©grĂ©e dans une chaĂźne de traitement capable de traiter de grandes quantitĂ©s de documents et permettant de gĂ©nĂ©rer des objets numĂ©riques au format METS/ALTO dĂ©crivant le contenu physique et l'organisation logique de ceux-ci. Nous souhaitons ainsi ouvrir de nouvelles perspectives de parcours des corpus de journaux anciens

    The Physical and Chemical Properties of Quinoline

    No full text

    Recent Work in the Field of High Pressures

    No full text

    Alkylquinolines and Arylquinolines

    No full text

    Bibliography

    No full text
    corecore