62 research outputs found
An Information Extraction Approach to Reorganizing and Summarizing Specifications
Materials and Process Specifications are complex semi-structured documents containing numeric data, text, and images. This article describes a coarse-grain extraction technique to automatically reorganize and summarize spec content. Specifically, a strategy for semantic-markup, to capture content within a semantic ontology, relevant to semi-automatic extraction, has been developed and experimented with. The working prototypes were built in the context of Cohesia\u27s existing software infrastructure, and use techniques from Information Extraction, XML technology, etc
A Robust Architecture For Human Language Technology Systems
Early human language technology systems were designed in a monolithic fashion. As these systems became more complex, this design became untenable. In its place, the concept of distributed processing evolved wherein the monolithic structure was decomposed into a number of functional components that could interact through a common protocol. This distributed framework was readily accepted by the research community and has been the cornerstone for the advancement in cutting edge human language technology prototype systems.The Defense Advanced Research Program Agency (DARPA) Communicator program has been highly successful in implementing this approach. The program has fueled the design and development of impressive human language technology applications. Its distributed framework has offered numerous benefits to the research community, including reduced prototype development time, sharing of components across sites, and provision of a standard evaluation platform. It has also enabled development of client-server applications with complex inter-process communication between modules. However, this latter feature, though beneficial, introduces complexities which reduce overall system robustness to failure. In addition, the ability to handle multiple users and multiple applications from a common interface is not innately supported. This thesis describes the enhancements to the original Communicator architecture that address robustness issues and provide a multiple multi-user application environment by enabling automated server startup, error detection and correction. Extensive experimentation and analysis were performed to measure improvements in robustness due to the enhancements to the DARPA architecture. A 7.2% improvement in robustness was achieved on the address querying task, which is the most complex task in the human language technology system
COSPO/CENDI Industry Day Conference
The conference's objective was to provide a forum where government information managers and industry information technology experts could have an open exchange and discuss their respective needs and compare them to the available, or soon to be available, solutions. Technical summaries and points of contact are provided for the following sessions: secure products, protocols, and encryption; information providers; electronic document management and publishing; information indexing, discovery, and retrieval (IIDR); automated language translators; IIDR - natural language capabilities; IIDR - advanced technologies; IIDR - distributed heterogeneous and large database support; and communications - speed, bandwidth, and wireless
An improved method for text summarization using lexical chains
This work is directed toward the creation of a system for automatically sum-marizing documents by extracting selected sentences. Several heuristics including position, cue words, and title words are used in conjunction with lexical chain in-formation to create a salience function that is used to rank sentences for extraction. Compiler technology, including the Flex and Bison tools, is used to create the AutoExtract summarizer that extracts and combines this information from the raw text. The WordNet database is used for the creation of the lexical chains. The AutoExtract summarizer performed better than the Microsoft Word97 AutoSummarize tool and the Sinope commercial summarizer in tests against ideal extracts and in tests judged by humans
The Zircon, March 25, 2022 [Spoof Issue]
https://digitalcollections.dordt.edu/dordt_diamond/1855/thumbnail.jp
Knowledge-based methods for automatic extraction of domain-specific ontologies
Semantic web technology aims at developing methodologies for representing large amount of knowledge in web accessible form. The semantics of knowledge should be easy to interpret and understand by computer programs, so that sharing and utilizing knowledge across the Web would be possible. Domain specific ontologies form the basis for knowledge representation in the semantic web. Research on automated development of ontologies from texts has become increasingly important because manual construction of ontologies is labor intensive and costly, and, at the same time, large amount of texts for individual domains is already available in electronic form. However, automatic extraction of domain specific ontologies is challenging due to the unstructured nature of texts and inherent semantic ambiguities in natural language. Moreover, the large size of texts to be processed renders full-fledged natural language processing methods infeasible. In this dissertation, we develop a set of knowledge-based techniques for automatic extraction of ontological components (concepts, taxonomic and non-taxonomic relations) from domain texts. The proposed methods combine information retrieval metrics, lexical knowledge-base(like WordNet), machine learning techniques, heuristics, and statistical approaches to meet the challenge of the task. These methods are domain-independent and automatic approaches. For extraction of concepts, the proposed WNSCA+{PE, POP} method utilizes the lexical knowledge base WordNet to improve precision and recall over the traditional information retrieval metrics. A WordNet-based approach, the compound term heuristic, and a supervised learning approach are developed for taxonomy extraction. We also developed a weighted word-sense disambiguation method for use with the WordNet-based approach. An unsupervised approach using log-likelihood ratios is proposed for extracting non-taxonomic relations. Further more, a supervised approach is investigated to learn the semantic constraints for identifying relations from prepositional phrases. The proposed methods are validated by experiments with the Electronic Voting and the Tender Offers, Mergers, and Acquisitions domain corpus. Experimental results and comparisons with some existing approaches clearly indicate the superiority of our methods. In summary, a good combination of information retrieval, lexical knowledge base, statistics and machine learning methods in this study has led to the techniques efficient and effective for extracting ontological components automatically
Toponym Resolution in Text
Institute for Communicating and Collaborative SystemsBackground. In the area of Geographic Information Systems (GIS), a shared discipline between
informatics and geography, the term geo-parsing is used to describe the process of identifying
names in text, which in computational linguistics is known as named entity recognition
and classification (NERC). The term geo-coding is used for the task of mapping from implicitly
geo-referenced datasets (such as structured address records) to explicitly geo-referenced
representations (e.g., using latitude and longitude). However, present-day GIS systems provide
no automatic geo-coding functionality for unstructured text.
In Information Extraction (IE), processing of named entities in text has traditionally been seen
as a two-step process comprising a flat text span recognition sub-task and an atomic classification
sub-task; relating the text span to a model of the world has been ignored by evaluations
such as MUC or ACE (Chinchor (1998); U.S. NIST (2003)).
However, spatial and temporal expressions refer to events in space-time, and the grounding of
events is a precondition for accurate reasoning. Thus, automatic grounding can improve many
applications such as automatic map drawing (e.g. for choosing a focus) and question answering
(e.g. , for questions like How far is London from Edinburgh?, given a story in which both occur
and can be resolved). Whereas temporal grounding has received considerable attention in the
recent past (Mani and Wilson (2000); Setzer (2001)), robust spatial grounding has long been
neglected.
Concentrating on geographic names for populated places, I define the task of automatic
Toponym Resolution (TR) as computing the mapping from occurrences of names for places as
found in a text to a representation of the extensional semantics of the location referred to (its
referent), such as a geographic latitude/longitude footprint.
The task of mapping from names to locations is hard due to insufficient and noisy databases,
and a large degree of ambiguity: common words need to be distinguished from proper names
(geo/non-geo ambiguity), and the mapping between names and locations is ambiguous (London
can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other
Londons on earth). In addition, names of places and the boundaries referred to change over
time, and databases are incomplete.
Objective. I investigate how referentially ambiguous spatial named entities can be grounded,
or resolved, with respect to an extensional coordinate model robustly on open-domain news
text.
I begin by comparing the few algorithms proposed in the literature, and, comparing semiformal,
reconstructed descriptions of them, I factor out a shared repertoire of linguistic heuristics
(e.g. rules, patterns) and extra-linguistic knowledge sources (e.g. population sizes). I then
investigate how to combine these sources of evidence to obtain a superior method. I also investigate
the noise effect introduced by the named entity tagging step that toponym resolution
relies on in a sequential system pipeline architecture.
Scope. In this thesis, I investigate a present-day snapshot of terrestrial geography as represented
in the gazetteer defined and, accordingly, a collection of present-day news text. I limit
the investigation to populated places; geo-coding of artifact names (e.g. airports or bridges),
compositional geographic descriptions (e.g. 40 miles SW of London, near Berlin), for instance,
is not attempted. Historic change is a major factor affecting gazetteer construction and ultimately
toponym resolution. However, this is beyond the scope of this thesis.
Method. While a small number of previous attempts have been made to solve the toponym
resolution problem, these were either not evaluated, or evaluation was done by manual inspection
of system output instead of curating a reusable reference corpus.
Since the relevant literature is scattered across several disciplines (GIS, digital libraries,
information retrieval, natural language processing) and descriptions of algorithms are mostly
given in informal prose, I attempt to systematically describe them and aim at a reconstruction
in a uniform, semi-formal pseudo-code notation for easier re-implementation. A systematic
comparison leads to an inventory of heuristics and other sources of evidence.
In order to carry out a comparative evaluation procedure, an evaluation resource is required.
Unfortunately, to date no gold standard has been curated in the research community. To this
end, a reference gazetteer and an associated novel reference corpus with human-labeled referent
annotation are created.
These are subsequently used to benchmark a selection of the reconstructed algorithms and
a novel re-combination of the heuristics catalogued in the inventory.
I then compare the performance of the same TR algorithms under three different conditions,
namely applying it to the (i) output of human named entity annotation, (ii) automatic annotation
using an existing Maximum Entropy sequence tagging model, and (iii) a na¨ıve toponym lookup
procedure in a gazetteer.
Evaluation. The algorithms implemented in this thesis are evaluated in an intrinsic or
component evaluation. To this end, we define a task-specific matching criterion to be used with
traditional Precision (P) and Recall (R) evaluation metrics. This matching criterion is lenient
with respect to numerical gazetteer imprecision in situations where one toponym instance is
marked up with different gazetteer entries in the gold standard and the test set, respectively, but
where these refer to the same candidate referent, caused by multiple near-duplicate entries in
the reference gazetteer.
Main Contributions. The major contributions of this thesis are as follows:
• A new reference corpus in which instances of location named entities have been manually
annotated with spatial grounding information for populated places, and an associated
reference gazetteer, from which the assigned candidate referents are chosen. This reference
gazetteer provides numerical latitude/longitude coordinates (such as 51320 North,
0 50 West) as well as hierarchical path descriptions (such as London > UK) with respect
to a world wide-coverage, geographic taxonomy constructed by combining several large,
but noisy gazetteers. This corpus contains news stories and comprises two sub-corpora,
a subset of the REUTERS RCV1 news corpus used for the CoNLL shared task (Tjong
Kim Sang and De Meulder (2003)), and a subset of the Fourth Message Understanding
Contest (MUC-4; Chinchor (1995)), both available pre-annotated with gold-standard.
This corpus will be made available as a reference evaluation resource;
• a new method and implemented system to resolve toponyms that is capable of robustly
processing unseen text (open-domain online newswire text) and grounding toponym instances
in an extensional model using longitude and latitude coordinates and hierarchical
path descriptions, using internal (textual) and external (gazetteer) evidence;
• an empirical analysis of the relative utility of various heuristic biases and other sources
of evidence with respect to the toponym resolution task when analysing free news genre
text;
• a comparison between a replicated method as described in the literature, which functions
as a baseline, and a novel algorithm based on minimality heuristics; and
• several exemplary prototypical applications to show how the resulting toponym resolution
methods can be used to create visual surrogates for news stories, a geographic exploration
tool for news browsing, geographically-aware document retrieval and to answer
spatial questions (How far...?) in an open-domain question answering system. These
applications only have demonstrative character, as a thorough quantitative, task-based
(extrinsic) evaluation of the utility of automatic toponym resolution is beyond the scope of this thesis and left for future work
- …