4,890 research outputs found
Automated Detection of Usage Errors in non-native English Writing
In an investigation of the use of a novelty detection algorithm for identifying inappropriate word
combinations in a raw English corpus, we employ an
unsupervised detection algorithm based on the one-
class support vector machines (OC-SVMs) and extract
sentences containing word sequences whose frequency
of appearance is significantly low in native English
writing. Combined with n-gram language models and
document categorization techniques, the OC-SVM classifier assigns given sentences into two different
groups; the sentences containing errors and those
without errors. Accuracies are 79.30 % with bigram
model, 86.63 % with trigram model, and 34.34 % with four-gram model
Post-OCR Paragraph Recognition by Graph Convolutional Networks
Paragraphs are an important class of document entities. We propose a new
approach for paragraph identification by spatial graph convolutional neural
networks (GCN) applied on OCR text boxes. Two steps, namely line splitting and
line clustering, are performed to extract paragraphs from the lines in OCR
results. Each step uses a beta-skeleton graph constructed from bounding boxes,
where the graph edges provide efficient support for graph convolution
operations. With only pure layout input features, the GCN model size is 3~4
orders of magnitude smaller compared to R-CNN based models, while achieving
comparable or better accuracies on PubLayNet and other datasets. Furthermore,
the GCN models show good generalization from synthetic training data to
real-world images, and good adaptivity for variable document styles
Handwritten Character Recognition of South Indian Scripts: A Review
Handwritten character recognition is always a frontier area of research in
the field of pattern recognition and image processing and there is a large
demand for OCR on hand written documents. Even though, sufficient studies have
performed in foreign scripts like Chinese, Japanese and Arabic characters, only
a very few work can be traced for handwritten character recognition of Indian
scripts especially for the South Indian scripts. This paper provides an
overview of offline handwritten character recognition in South Indian Scripts,
namely Malayalam, Tamil, Kannada and Telungu.Comment: Paper presented on the "National Conference on Indian Language
Computing", Kochi, February 19-20, 2011. 6 pages, 5 figure
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes
Is it possible to train a general metric for evaluating text generation
quality without human annotated ratings? Existing learned metrics either
perform unsatisfactorily across text generation tasks or require human ratings
for training on specific tasks. In this paper, we propose SESCORE2, a
self-supervised approach for training a model-based metric for text generation
evaluation. The key concept is to synthesize realistic model mistakes by
perturbing sentences retrieved from a corpus. The primary advantage of the
SESCORE2 is its ease of extension to many other languages while providing
reliable severity estimation. We evaluate SESCORE2 and previous methods on four
text generation tasks across three languages. SESCORE2 outperforms unsupervised
metric PRISM on four text generation evaluation benchmarks, with a Kendall
improvement of 0.078. Surprisingly, SESCORE2 even outperforms the supervised
BLEURT and COMET on multiple text generation tasks. The code and data are
available at https://github.com/xu1998hz/SEScore2.Comment: Accepted at ACL2023 Main Conferenc
AI-assisted patent prior art searching - feasibility study
This study seeks to understand the feasibility, technical complexities and effectiveness of using artificial intelligence (AI) solutions to improve operational processes of registering IP rights. The Intellectual Property Office commissioned Cardiff University to undertake this research. The research was funded through the BEIS Regulators’ Pioneer Fund (RPF). The RPF fund was set up to help address barriers to innovation in the UK economy
- …