Search CORE

6,151 research outputs found

A step towards understanding paper documents

Author: Dengel Andreas
Publication venue: Sonstige Einrichtungen. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz
Publication date: 01/01/1990
Field of study

This report focuses on analysis steps necessary for a paper document processing. It is divided in three major parts: a document image preprocessing, a knowledge-based geometric classification of the image, and a expectation-driven text recognition. It first illustrates the several low level image processing procedures providing the physical document structure of a scanned document image. Furthermore, it describes a knowledge-based approach, developed for the identification of logical objects (e.g., sender or the footnote of a letter) in a document image. The logical identifiers provide a context-restricted consideration of the containing text. While using specific logical dictionaries, a expectation-driven text recognition is possible to identify text parts of specific interest. The system has been implemented for the analysis of single-sided business letters in Common Lisp on a SUN 3/60 Workstation. It is running for a large population of different letters. The report also illustrates and discusses examples of typical results obtained by the system

Universaar

Acronym

Southeast Asia Primary Learning Metrics (SEA-PLM) Assessment Framework

Author: Australian Council for Educational Research (ACER)
Southeast Asian Ministers of Education Organization - SEAMEO
UNICEF
Publication venue: ACEReSearch
Publication date: 01/01/2017
Field of study

This assessment framework for the South-East Asia Primary Learning Metric (SEA-PLM) assessment program outlines an approach to assessing mathematical literacy (Chapter 2), reading literacy (Chapter 3) and writing literacy (Chapter 4). It also puts forward a conceptual framework for the context questionnaires (Chapter 5). The orientation implied by these labels is intended to emphasise that the curriculum arrangements in participating countries, which are necessarily at the centre of a regional assessment program, have as a major purpose the preparation of young people to participate effectively as members of society in such a way that they can use what they have learned at school – their reading, writing and mathematics skills, and their citizenship – to deal with the many challenges they will meet in their life beyond school. The purpose of this assessment framework is to articulate the basic structure of the SEA-PLM. It provides a description of the constructs to be measured. It also outlines the design and content of the measurement instruments and describes how measures generated by those instruments relate to the constructs

Repositorio Institucional Universidad César Vallejo: Página de inicio

ACEReSearch

Registro Nacional de Trabajos de Investigación y Proyectos

The FORTRAN static source code analyzer program (SAP) user's guide, revision 1

Author: Decker W.
Eslinger S.
Taylor W.
Publication venue
Publication date
Field of study

The FORTRAN Static Source Code Analyzer Program (SAP) User's Guide (Revision 1) is presented. SAP is a software tool designed to assist Software Engineering Laboratory (SEL) personnel in conducting studies of FORTRAN programs. SAP scans FORTRAN source code and produces reports that present statistics and measures of statements and structures that make up a module. This document is a revision of the previous SAP user's guide, Computer Sciences Corporation document CSC/TM-78/6045. SAP Revision 1 is the result of program modifications to provide several new reports, additional complexity analysis, and recognition of all statements described in the FORTRAN 77 standard. This document provides instructions for operating SAP and contains information useful in interpreting SAP output

NASA Technical Reports Server

Plagiarism detection for Indonesian texts

Author: Krisnawati Lucia Dwi
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 18/05/2016
Field of study

As plagiarism becomes an increasing concern for Indonesian universities and research centers, the need of using automatic plagiarism checker is becoming more real. However, researches on Plagiarism Detection Systems (PDS) in Indonesian documents have not been well developed, since most of them deal with detecting duplicate or near-duplicate documents, have not addressed the problem of retrieving source documents, or show tendency to measure document similarity globally. Therefore, systems resulted from these researches are incapable of referring to exact locations of ``similar passage'' pairs. Besides, there has been no public and standard corpora available to evaluate PDS in Indonesian texts. To address the weaknesses of former researches, this thesis develops a plagiarism detection system which executes various methods of plagiarism detection stages in a workflow system. In retrieval stage, a novel document feature coined as phraseword is introduced and executed along with word unigram and character n-grams to address the problem of retrieving source documents, whose contents are copied partially or obfuscated in a suspicious document. The detection stage, which exploits a two-step paragraph-based comparison, is aimed to address the problems of detecting and locating source-obfuscated passage pairs. The seeds for matching source-obfuscated passage pairs are based on locally-weighted significant terms to capture paraphrased and summarized passages. In addition to this system, an evaluation corpus was created through simulation by human writers, and by algorithmic random generation. Using this corpus, the performance evaluation of the proposed methods was performed in three scenarios. On the first scenario which evaluated source retrieval performance, some methods using phraseword and token features were able to achieve the optimum recall rate 1. On the second scenario which evaluated detection performance, our system was compared to Alvi's algorithm and evaluated in 4 levels of measures: character, passage, document, and cases. The experiment results showed that methods resulted from using token as seeds have higher scores than Alvi's algorithm in all 4 levels of measures both in artificial and simulated plagiarism cases. In case detection, our systems outperform Alvi's algorithm in recognizing copied, shaked, and paraphrased passages. However, Alvi's recognition rate on summarized passage is insignificantly higher than our system. The same tendency of experiment results were demonstrated on the third experiment scenario, only the precision rates of Alvi's algorithm in character and paragraph levels are higher than our system. The higher Plagdet scores produced by some methods in our system than Alvi's scores show that this study has fulfilled its objective in implementing a competitive state-of-the-art algorithm for detecting plagiarism in Indonesian texts. Being run at our test document corpus, Alvi's highest scores of recall, precision, Plagdet, and detection rate on no-plagiarism cases correspond to its scores when it was tested on PAN'14 corpus. Thus, this study has contributed in creating a standard evaluation corpus for assessing PDS for Indonesian documents. Besides, this study contributes in a source retrieval algorithm which introduces phrasewords as document features, and a paragraph-based text alignment algorithm which relies on two different strategies. One of them is to apply local-word weighting used in text summarization field to select seeds for both discriminating paragraph pair candidates and matching process. The proposed detection algorithm results in almost no multiple detection. This contributes to the strength of this algorithm

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Writing as a sociolinguistic object

Author: Agha
Barton
Barton
Basso
Blommaert
Blommaert
Blommaert
Blommaert
Blommaert
Blommaert
Blommaert
Blommaert
Blommaert
Burke
Canagarajah
Collins
Goffman
Goffman
Gumperz
Heath
Hymes
Jakobson
Jørgensen
Kress
Kress
Kress
Kress
Miceli
Prinsloo
Rampton
Scollon
Silverstein
Silverstein
Silverstein
Silverstein
Smits-Engelman
Street
Stroud
Varis
Velghe
Verschueren
Wang
Publication venue: 'Wiley'
Publication date: 01/01/2012
Field of study

Crossref

Ghent University Academic Bibliography

Tilburg University Repository

Off-line Thai handwriting recognition in legal amount

Author: Chatwiriya Watchara
Publication venue: The Research Repository @ WVU
Publication date: 01/12/2002
Field of study

Thai handwriting in legal amounts is a challenging problem and a new field in the area of handwriting recognition research. The focus of this thesis is to implement Thai handwriting recognition system. A preliminary data set of Thai handwriting in legal amounts is designed. The samples in the data set are characters and words of the Thai legal amounts and a set of legal amounts phrases collected from a number of native Thai volunteers. At the preprocessing and recognition process, techniques are introduced to improve the characters recognition rates. The characters are divided into two smaller subgroups by their writing levels named body and high groups. The recognition rates of both groups are increased based on their distinguished features. The writing level separation algorithms are implemented using the size and position of characters. Empirical experiments are set to test the best combination of the feature to increase the recognition rates. Traditional recognition systems are modified to give the accumulative top-3 ranked answers to cover the possible character classes. At the postprocessing process level, the lexicon matching algorithms are implemented to match the ranked characters with the legal amount words. These matched words are joined together to form possible choices of amounts. These amounts will have their syntax checked in the last stage. Several syntax violations are caused by consequence faulty character segmentation and recognition resulting from connecting or broken characters. The anomaly in handwriting caused by these characters are mainly detected by their size and shape. During the recovery process, the possible word boundary patterns can be pre-defined and used to segment the hypothesis words. These words are identified by the word recognition and the results are joined with previously matched words to form the full amounts and checked by the syntax rules again. From 154 amounts written by 10 writers, the rejection rate is 14.9 percent with the recovery processes. The recognition rate for the accepted amount is 100 percent

The Research Repository @ WVU (West Virginia University)

Adapting a relation extraction pipeline for the BioCreAtIvE II task

Author: Grover Claire
Haddow Barry
Klein Ewan
Matthews Michael
Nielsen Leif Arda
Tobin Richard
Wang Xinglong
Publication venue
Publication date: 01/01/2007
Field of study

Edinburgh Research Explorer

Hybrid semantic-document models

Author: Darren Clowes (7168448)
Publication venue
Publication date: 01/01/2013
Field of study

This thesis presents the concept of hybrid semantic-document models to aid information management when using standards for complex technical domains such as military data communication. These standards are traditionally text based documents for human interpretation, but prose sections can often be ambiguous and can lead to discrepancies and subsequent implementation problems. Many organisations produce semantic representations of the material to ensure common understanding and to exploit computer aided development. In developing these semantic representations, no relationship is maintained to the original prose. Maintaining relationships between the original prose and the semantic model has key benefits, including assessing conformance at a semantic level, and enabling original content authors to explicitly define their intentions, thus reducing ambiguity and facilitating computer aided functionality. Through the use of a case study method based on the military standard MIL-STD-6016C, a framework of relationships is proposed. These relationships can integrate with common document modelling techniques and provide the necessary functionality to allow semantic content to be mapped into document views. These relationships are then generalised for applicability to a wider context. Additionally, this framework is coupled with a templating approach which, for repeating sections, can improve consistency and further enhance quality. A reflective approach to model driven web rendering is presented and evaluated. This reflective approach uses self-inspection at runtime to read directly from the model, thus eliminating the need for any generative processes which result in data duplication across source used for different purpose

Loughborough University Institutional Repository

Natural language software registry (second edition)

Author: Hinkelman Elizabeth
Jung Christoph
Vonerden Markus
Publication venue: Sonstige Einrichtungen. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz
Publication date: 01/01/1993
Field of study

Universaar

Acronym