4,221 research outputs found
Visual Execution and Data Visualisation in Natural Language Processing
We describe GGI, a visual system that allows the user to execute an automatically generated data flow graph containing code modules that perform natural language processing tasks. These code modules operate on text documents. GGI has a suite of text visualisation tools that allows the user useful views of the annotation data that is produced by the modules in the executable graph. GGI forms part of the GATE natural language engineering system
Image and interpretation using artificial intelligence to read ancient Roman texts
The ink and stylus tablets discovered at the Roman Fort of Vindolanda are a unique resource for scholars of ancient history. However, the stylus tablets have proved particularly difficult to read. This paper describes a system that assists expert papyrologists in the interpretation of the Vindolanda writing tablets. A model-based approach is taken that relies on models of the written form of characters, and statistical modelling of language, to produce plausible interpretations of the documents. Fusion of the contributions from the language, character, and image feature models is achieved by utilizing the GRAVA agent architecture that uses Minimum Description Length as the basis for information fusion across semantic levels. A system is developed that reads in image data and outputs plausible interpretations of the Vindolanda tablets
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
Development of parsing tools for Casl using generic language technology
An environment for the Common Algebraic Specification Language CASL consists of independent tools. A number of CASL have been built using the algebraic formalism ASF+SDF and the+SDF Meta-Environment. CASL supports-defined syntax which is non-trivial to: ASF+SDF offers a powerful parsing(Generalized LR). Its interactive environment facilitates rapid complemented by early detection correction of errors. A number of core developed for the ASF+SDF-Environment can be reused in the context CASL. Furthermore, an instantiation of a format developed for the representation ASF+SDF specifications and terms provides a-specific exchange format
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale
Over the past few decades, the amount of scientific articles and technical
literature has increased exponentially in size. Consequently, there is a great
need for systems that can ingest these documents at scale and make the
contained knowledge discoverable. Unfortunately, both the format of these
documents (e.g. the PDF format or bitmap images) as well as the presentation of
the data (e.g. complex tables) make the extraction of qualitative and
quantitive data extremely challenging. In this paper, we present a modular,
cloud-based platform to ingest documents at scale. This platform, called the
Corpus Conversion Service (CCS), implements a pipeline which allows users to
parse and annotate documents (i.e. collect ground-truth), train
machine-learning classification algorithms and ultimately convert any type of
PDF or bitmap-documents to a structured content representation format. We will
show that each of the modules is scalable due to an asynchronous microservice
architecture and can therefore handle massive amounts of documents.
Furthermore, we will show that our capability to gather ground-truth is
accelerated by machine-learning algorithms by at least one order of magnitude.
This allows us to both gather large amounts of ground-truth in very little time
and obtain very good precision/recall metrics in the range of 99\% with regard
to content conversion to structured output. The CCS platform is currently
deployed on IBM internal infrastructure and serving more than 250 active users
for knowledge-engineering project engagements.Comment: Accepted paper at KDD 2018 conferenc
- âŠ