Search CORE

4,890 research outputs found

Automated Detection of Usage Errors in non-native English Writing

Author: Fujishima Satoru
Ishizaki Shun
Publication venue
Publication date: 26/10/2011
Field of study

In an investigation of the use of a novelty detection algorithm for identifying inappropriate word combinations in a raw English corpus, we employ an unsupervised detection algorithm based on the one- class support vector machines (OC-SVMs) and extract sentences containing word sequences whose frequency of appearance is significantly low in native English writing. Combined with n-gram language models and document categorization techniques, the OC-SVM classifier assigns given sentences into two different groups; the sentences containing errors and those without errors. Accuracies are 79.30 % with bigram model, 86.63 % with trigram model, and 34.34 % with four-gram model

EEPIS Repository

Decompositional Argument Mining:A General Purpose Approach for Argument Graph Construction

Author: Gemechu Debela
Reed Chris
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Crossref

University of Dundee Online Publications

Post-OCR Paragraph Recognition by Graph Convolutional Networks

Author: Fujii Yasuhisa
Popat Ashok C.
Wang Renshen
Publication venue
Publication date: 20/07/2021
Field of study

Paragraphs are an important class of document entities. We propose a new approach for paragraph identification by spatial graph convolutional neural networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and line clustering, are performed to extract paragraphs from the lines in OCR results. Each step uses a beta-skeleton graph constructed from bounding boxes, where the graph edges provide efficient support for graph convolution operations. With only pure layout input features, the GCN model size is 3~4 orders of magnitude smaller compared to R-CNN based models, while achieving comparable or better accuracies on PubLayNet and other datasets. Furthermore, the GCN models show good generalization from synthetic training data to real-world images, and good adaptivity for variable document styles

arXiv.org e-Print Archive

Handwritten Character Recognition of South Indian Scripts: A Review

Author: Jomy John
Kannan Balakrishnan
Pramod K. V.
Publication venue
Publication date: 01/06/2011
Field of study

Handwritten character recognition is always a frontier area of research in the field of pattern recognition and image processing and there is a large demand for OCR on hand written documents. Even though, sufficient studies have performed in foreign scripts like Chinese, Japanese and Arabic characters, only a very few work can be traced for handwritten character recognition of Indian scripts especially for the South Indian scripts. This paper provides an overview of offline handwritten character recognition in South Indian Scripts, namely Malayalam, Tamil, Kannada and Telungu.Comment: Paper presented on the "National Conference on Indian Language Computing", Kochi, February 19-20, 2011. 6 pages, 5 figure

arXiv.org e-Print Archive

SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes

Author: Li Lei
Qian Xian
Wang Mingxuan
Wang William Yang
Xu Wenda
Publication venue
Publication date: 07/07/2023
Field of study

Is it possible to train a general metric for evaluating text generation quality without human annotated ratings? Existing learned metrics either perform unsatisfactorily across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SESCORE2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. The primary advantage of the SESCORE2 is its ease of extension to many other languages while providing reliable severity estimation. We evaluate SESCORE2 and previous methods on four text generation tasks across three languages. SESCORE2 outperforms unsupervised metric PRISM on four text generation evaluation benchmarks, with a Kendall improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks. The code and data are available at https://github.com/xu1998hz/SEScore2.Comment: Accepted at ACL2023 Main Conferenc

arXiv.org e-Print Archive

AI-assisted patent prior art searching - feasibility study

Author: Setchi Rossi
Spasic Irena
Publication venue: The Intellectual Property Office
Publication date: 30/04/2020
Field of study

This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy

Online Research @ Cardiff