41,102 research outputs found

    Logical segmentation for article extraction in digitized old newspapers

    Full text link
    Newspapers are documents made of news item and informative articles. They are not meant to be red iteratively: the reader can pick his items in any order he fancies. Ignoring this structural property, most digitized newspaper archives only offer access by issue or at best by page to their content. We have built a digitization workflow that automatically extracts newspaper articles from images, which allows indexing and retrieval of information at the article level. Our back-end system extracts the logical structure of the page to produce the informative units: the articles. Each image is labelled at the pixel level, through a machine learning based method, then the page logical structure is constructed up from there by the detection of structuring entities such as horizontal and vertical separators, titles and text lines. This logical structure is stored in a METS wrapper associated to the ALTO file produced by the system including the OCRed text. Our front-end system provides a web high definition visualisation of images, textual indexing and retrieval facilities, searching and reading at the article level. Articles transcriptions can be collaboratively corrected, which as a consequence allows for better indexing. We are currently testing our system on the archives of the Journal de Rouen, one of France eldest local newspaper. These 250 years of publication amount to 300 000 pages of very variable image quality and layout complexity. Test year 1808 can be consulted at plair.univ-rouen.fr.Comment: ACM Document Engineering, France (2012

    Developing information architecture through records management classification techniques

    Get PDF
    Purpose – This work aims to draw attention to information retrieval philosophies and techniques allied to the records management profession, advocating a wider professional consideration of a functional approach to information management, in this instance in the development of information architecture. Design/methodology/approach – The paper draws from a hypothesis originally presented by the author that advocated a viewpoint whereby the application of records management techniques, traditionally applied to develop business classification schemes, was offered as an additional solution to organising information resources and services (within a university intranet), where earlier approaches, notably subject- and administrative-based arrangements, were found to be lacking. The hypothesis was tested via work-based action learning and is presented here as an extended case study. The paper also draws on evidence submitted to the Joint Information Systems Committee in support of the Abertay University's application for consideration for the JISC award for innovation in records and information management. Findings – The original hypothesis has been tested in the workplace. Information retrieval techniques, allied to records management (functional classification), were the main influence in the development of pre- and post-coordinate information retrieval systems to support a wider information architecture, where the subject approach was found to be lacking. Their use within the workplace has since been extended. Originality/value – The paper advocates that the development of information retrieval as a discipline should include a wider consideration of functional classification, as this alternative to the subject approach is largely ignored in mainstream IR works

    A review of the state of the art in Machine Learning on the Semantic Web: Technical Report CSTR-05-003

    Get PDF

    Uncertainty Detection as Approximate Max-Margin Sequence Labelling

    Get PDF
    This paper reports experiments for the CoNLL 2010 shared task on learning to detect hedges and their scope in natural language text. We have addressed the experimental tasks as supervised linear maximum margin prediction problems. For sentence level hedge detection in the biological domain we use an L1-regularised binary support vector machine, while for sentence level weasel detection in the Wikipedia domain, we use an L2-regularised approach. We model the in-sentence uncertainty cue and scope detection task as an L2-regularised approximate maximum margin sequence labelling problem, using the BIO-encoding. In addition to surface level features, we use a variety of linguistic features based on a functional dependency analysis. A greedy forward selection strategy is used in exploring the large set of potential features. Our official results for Task 1 for the biological domain are 85.2 F1-score, for the Wikipedia set 55.4 F1-score. For Task 2, our official results are 2.1 for the entire task with a score of 62.5 for cue detection. After resolving errors and final bugs, our final results are for Task 1, biological: 86.0, Wikipedia: 58.2; Task 2, scopes: 39.6 and cues: 78.5

    Developing geometrical reasoning in the secondary school: outcomes of trialling teaching activities in classrooms, a report to the QCA

    No full text
    This report presents the findings of the Southampton/Hampshire Group of mathematicians and mathematics educators sponsored by the Qualifications and Curriculum Authority (QCA) to develop and trial some teaching/learning materials for use in schools that focus on the development of geometrical reasoning at the secondary school level. The project ran from October 2002 to November 2003. An interim report was presented to the QCA in March 2003. 1. The Southampton/Hampshire Group consisted of five University mathematicians and mathematics educators, a local authority inspector, and five secondary school teachers of mathematics. The remit of the group was to develop and report on teaching ideas that focus on the development of geometrical reasoning at the secondary school level. 2. In reviewing the existing geometry curriculum, the group endorsed the RS/ JMC working group conclusion (RS/ JMC geometry report, 2001) that the current mathematics curriculum for England contains sufficient scope for the development of geometrical reasoning, but that it would benefit from some clarification in respect of this aspect of geometry education. Such clarification would be especially helpful in resolving the very odd separation, in the programme of study for mathematics, of ‘geometrical reasoning’ from ‘transformations and co-ordinates’, as if transformations, for example, cannot be used in geometrical reasoning. 3. The group formulated a rationale for designing and developing suitable teaching materials that support the teaching and learning of geometrical reasoning. The group suggests the following as guiding principles: • Geometrical situations selected for use in the classroom should, as far as possible, be chosen to be useful, interesting and/or surprising to pupils; • Activities should expect pupils to explain, justify or reason and provide opportunities for pupils to be critical of their own, and their peers’, explanations; • Activities should provide opportunities for pupils to develop problem solving skills and to engage in problem posing; • The forms of reasoning expected should be examples of local deduction, where pupils can utilise any geometrical properties that they know to deduce or explain other facts or results. • To build on pupils’ prior experience, activities should involve the properties of 2D and 3D shapes, aspects of position and direction, and the use of transformation-based arguments that are about the geometrical situation being studied (rather than being about transformations per se); • The generating of data or the use of measurements, while playing important parts in mathematics, and sometimes assisting with the building of conjectures, should not be an end point to pupils’ mathematical activity. Indeed, where sensible, in order to build geometric reasoning and discourage over-reliance on empirical verification, many classroom activities might use contexts where measurements or other forms of data are not generated. 4. In designing and trialling suitable classroom material, the group found that the issue of how much structure to provide in a task is an important factor in maximising the opportunity for geometrical reasoning to take place. The group also found that the role of the teacher is vital in helping pupils to progress beyond straightforward descriptions of geometrical observations to encompass the reasoning that justifies those observations. Teacher knowledge in the area of geometry is therefore important. 5. The group found that pupils benefit from working collaboratively in groups with the kind of discussion and argumentation that has to be used to articulate their geometrical reasoning. This form of organisation creates both the need and the forum for argumentation that can lead to mathematical explanation. Such development to mathematical explanation, and the forms for collaborative working that support it, do not, however, necessarily occur spontaneously. Such things need careful planning and teaching. 6. Whilst pupils can demonstrate their reasoning ability orally, either as part of group discussion or through presentation of group work to a class, the transition to individual recording of reasoned argument causes significant problems. Several methods have been used successfully in this project to support this transition, including 'fact cards' and 'writing frames', but more research is needed into ways of helping written communication of geometrical reasoning to develop. 7. It was found possible in this study to enable pupils from all ages and attainments within the lower secondary (Key Stage 3) curriculum to participate in mathematical reasoning, given appropriate tasks, teaching and classroom culture. Given the finding of the project that many pupils know more about geometrical reasoning than they can demonstrate in writing, the emphasis in assessment on individual written response does not capture the reasoning skills which pupils are able to develop and exercise. Sufficient time is needed for pupils to engage in reasoning through a variety of activities; skills of reasoning and communication are unlikely to be absorbed quickly by many students. 8. The study suggests that it is appropriate for all teachers to aim to develop the geometrical reasoning of all pupils, but equally that this is a non-trivial task. Obstacles that need to be overcome are likely to include uncertainty about the nature of mathematical reasoning and about what is expected to be taught in this area among many teachers, lack of exemplars of good practice (although we have tried to address this by lesson descriptions in this report), especially in using transformational arguments, lack of time and freedom in the curriculum to properly develop work in this area, an assessment system which does not recognise students’ oral powers of reasoning, and a lack of appreciation of the value of geometry as a vehicle for broadening the curriculum for high attainers, as well as developing reasoning and communication skills for all students. 9. Areas for further work include future work in the area of geometrical reasoning, include the need for longitudinal studies of how geometrical reasoning develops through time given a sustained programme of activities (in this project we were conscious that the timescale on which we were working only enabled us to present 'snapshots'), studies and evaluation of published materials on geometrical reasoning, a study of 'critical experiences' which influence the development of geometrical reasoning, an analysis of the characteristics of successful and unsuccessful tasks for geometrical reasoning, a study of the transition from verbal reasoning to written reasoning, how overall perceptions of geometrical figures ('gestalt') develops as a component of geometrical reasoning (including how to create the links which facilitate this), and the use of dynamic geometry software in any (or all) of the above.10. As this group was one of six which could form a model for part of the work of regional centres set up like the IREMs in France, it seems worth recording that the constitution of the group worked very well, especially after members had got to know each other by working in smaller groups on specific topics. The balance of differing expertise was right, and we all felt that we learned a great deal from other group members during the experience. Overall, being involved in this type of research and development project was a powerful form of professional development for all those concerned. In retrospect, the group could have benefited from some longer full-day meetings to jointly develop ideas and analyse the resulting classroom material and experience rather than the pattern of after-school meetings that did not always allow sufficient time to do full justice to the complexity of many of the issues the group was tackling

    Ground Truth for Layout Analysis Performance Evaluation

    No full text
    Over the past two decades a significant number of layout analysis (page segmentation and region classification) approaches have been proposed in the literature. Each approach has been devised for and/or evaluated using (usually small) application-specific datasets. While the need for objective performance evaluation of layout analysis algorithms is evident, there does not exist a suitable dataset with ground truth that reflects the realities of everyday documents (widely varying layouts, complex entities, colour, noise etc.). The most significant impediment is the creation of accurate and flexible (in representation) ground truth, a task that is costly and must be carefully designed. This paper discusses the issues related to the design, representation and creation of ground truth in the context of a realistic dataset developed by the authors. The effectiveness of the ground truth discussed in this paper has been successfully shown in its use for two international page segmentation competitions (ICDAR2003 and ICDAR2005)
    • …
    corecore