1,642 research outputs found
New Methods, Current Trends and Software Infrastructure for NLP
The increasing use of `new methods' in NLP, which the NeMLaP conference
series exemplifies, occurs in the context of a wider shift in the nature and
concerns of the discipline. This paper begins with a short review of this
context and significant trends in the field. The review motivates and leads to
a set of requirements for support software of general utility for NLP research
and development workers. A freely-available system designed to meet these
requirements is described (called GATE - a General Architecture for Text
Engineering). Information Extraction (IE), in the sense defined by the Message
Understanding Conferences (ARPA \cite{Arp95}), is an NLP application in which
many of the new methods have found a home (Hobbs \cite{Hob93}; Jacobs ed.
\cite{Jac92}). An IE system based on GATE is also available for research
purposes, and this is described. Lastly we review related work.Comment: 12 pages, LaTeX, uses nemlap.sty (included
Design issues in the production of hyperâbooks and visualâbooks
This paper describes an ongoing research project in the area of electronic books. After a brief overview of the state of the art in this field, two new forms of electronic book are presented: hyperâbooks and visualâbooks. A flexible environment allows them to be produced in a semiâautomatic way starting from different sources: electronic texts (as input for hyperâbooks) and paper books (as input for visualâbooks). The translation process is driven by the philosophy of preserving the book metaphor in order to guarantee that electronic information is presented in a familiar way. Another important feature of our research is that hyperâbooks and visualâbooks are conceived not as isolated objects but as entities within an electronic library, which inherits most of the features of a paperâbased library but introduces a number of new properties resulting from its nonâphysical nature
A multi-layered Bayesian network model for structured document retrieval
New standards in document representation, like for example SGML, XML, and MPEG-7, compel Information Retrieval to design and implement models and tools to index, retrieve and present documents according to the given document structure. The paper presents the design of an Information Retrieval system for multimedia structured documents, like for example journal articles, e-books, and MPEG-7 videos. The system is based on Bayesian Networks, since this class of mathematical models enable to represent and quantify the relations between the structural components of the document. Some preliminary results on the system implementation are also presented
A multi-layered Bayesian network model for structured document retrieval
New standards in document representation, like for example SGML, XML, and MPEG-7, compel Information Retrieval to design and implement models and tools to index, retrieve and present documents according to the given document structure. The paper presents the design of an Information Retrieval system for multimedia structured documents, like for example journal articles, e-books, and MPEG-7 videos. The system is based on Bayesian Networks, since this class of mathematical models enable to represent and quantify the relations between the structural components of the document. Some preliminary results on the system implementation are also presented
Thick 2D Relations for Document Understanding
We use a propositional language of qualitative rectangle relations to detect the reading order from document images. To this end, we define the notion of a document encoding rule and we analyze possible formalisms to express document encoding rules such as LATEX and SGML. Document encoding rules expressed in the propositional language of rectangles are used to build a reading order detector for document images. In order to achieve robustness and avoid brittleness when applying the system to real life document images, the notion of a thick boundary interpretation for a qualitative relation is introduced. The framework is tested on a collection of heterogeneous document images showing recall rates up to 89%
Software Infrastructure for Natural Language Processing
We classify and review current approaches to software infrastructure for
research, development and delivery of NLP systems. The task is motivated by a
discussion of current trends in the field of NLP and Language Engineering. We
describe a system called GATE (a General Architecture for Text Engineering)
that provides a software infrastructure on top of which heterogeneous NLP
processing modules may be evaluated and refined individually, or may be
combined into larger application systems. GATE aims to support both researchers
and developers working on component technologies (e.g. parsing, tagging,
morphological analysis) and those working on developing end-user applications
(e.g. information extraction, text summarisation, document generation, machine
translation, and second language learning). GATE promotes reuse of component
technology, permits specialisation and collaboration in large-scale projects,
and allows for the comparison and evaluation of alternative technologies. The
first release of GATE is now available - see
http://www.dcs.shef.ac.uk/research/groups/nlp/gate/Comment: LaTeX, uses aclap.sty, 8 page
SWI-Prolog and the Web
Where Prolog is commonly seen as a component in a Web application that is
either embedded or communicates using a proprietary protocol, we propose an
architecture where Prolog communicates to other components in a Web application
using the standard HTTP protocol. By avoiding embedding in external Web servers
development and deployment become much easier. To support this architecture, in
addition to the transfer protocol, we must also support parsing, representing
and generating the key Web document types such as HTML, XML and RDF.
This paper motivates the design decisions in the libraries and extensions to
Prolog for handling Web documents and protocols. The design has been guided by
the requirement to handle large documents efficiently. The described libraries
support a wide range of Web applications ranging from HTML and XML documents to
Semantic Web RDF processing.
To appear in Theory and Practice of Logic Programming (TPLP)Comment: 31 pages, 24 figures and 2 tables. To appear in Theory and Practice
of Logic Programming (TPLP
Ranking structured documents using utility theory in the Bayesian network retrieval model
In this paper a new method based on Utility and Decision theory is presented to deal with structured documents. The aim of the application of these methodologies is to refine a first ranking of structural units, generated by means of an Information Retrieval Model based on Bayesian Networks. Units are newly arranged in the new ranking by combining their posterior probabilities, obtained in the first stage, with the expected utility of retrieving them. The experimental work has been developed using the Shakespeare structured collection and the results show an improvement of the effectiveness of this new approach
GATE -- an Environment to Support Research and Development in Natural Language Engineering
We describe a software environment to support research and development in natural language (NL) engineering. This environment -- GATE (General Architecture for Text Engineering) -- aims to advance research in the area of machine processing of natural languages by providing a software infrastructure on top of which heterogeneous NL component modules may be evaluated and refined individually or may be combined into larger application systems. Thus, GATE aims to support both researchers and developers working on component technologies (e.g. parsing, tagging, morphological analysis) and those working on developing end-user applications (e.g. information extraction, text summarisation, document generation, machine translation, and second language learning). GATE will promote reuse of component technology, permit specialisation and collaboration in large-scale projects, and allow for the comparison and evaluation of alternative technologies. The first release of GATE is now available
- âŚ