4 research outputs found
Pattern Matching and Discourse Processing in Information Extraction from Japanese Text
Information extraction is the task of automatically picking up information of
interest from an unconstrained text. Information of interest is usually
extracted in two steps. First, sentence level processing locates relevant
pieces of information scattered throughout the text; second, discourse
processing merges coreferential information to generate the output. In the
first step, pieces of information are locally identified without recognizing
any relationships among them. A key word search or simple pattern search can
achieve this purpose. The second step requires deeper knowledge in order to
understand relationships among separately identified pieces of information.
Previous information extraction systems focused on the first step, partly
because they were not required to link up each piece of information with other
pieces. To link the extracted pieces of information and map them onto a
structured output format, complex discourse processing is essential. This paper
reports on a Japanese information extraction system that merges information
using a pattern matcher and discourse processor. Evaluation results show a high
level of system performance which approaches human performance.Comment: See http://www.jair.org/ for any accompanying file
COSPO/CENDI Industry Day Conference
The conference's objective was to provide a forum where government information managers and industry information technology experts could have an open exchange and discuss their respective needs and compare them to the available, or soon to be available, solutions. Technical summaries and points of contact are provided for the following sessions: secure products, protocols, and encryption; information providers; electronic document management and publishing; information indexing, discovery, and retrieval (IIDR); automated language translators; IIDR - natural language capabilities; IIDR - advanced technologies; IIDR - distributed heterogeneous and large database support; and communications - speed, bandwidth, and wireless
Named entity extraction for speech
Named entity extraction is a field that has generated much interest over recent years
with the explosion of the World Wide Web and the necessity for accurate information
retrieval. Named entity extraction, the task of finding specific entities within documents,
has proven of great benefit for numerous information extraction and information retrieval
tasks.As well as multiple language evaluations, named entity extraction has been investigated
on a variety of media forms with varying success. In general, these media forms
have all been based upon standard text and assumed that any variation from standard
text constitutes noise.We investigate how it is possible to find named entities in speech data.. Where
others have focussed on applying named entity extraction techniques to transcriptions
of speech, we investigate a method for finding the named entities direct from the word
lattices associated with the speech signal. The results show that it is possible to improve
named entity recognition at the expense of word error rate (WER) in contrast to the
general view that F -score is directly proportional to WER.We use a. Hidden Markov Model {HMM) style approach to the task of named entity
extraction and show how it is possible to utilise a HMM to find named entities
within speech lattices. We further investigate how it is possible to improve results by
considering an alternative derivation of the joint probability of words and entities than
is traditionally used. This new derivation is particularly appropriate to speech lattices
as no presumptions are made about the sequence of words.The HMM style approach that we use requires using a number of language models
in parallel. We have developed a system for discriminately retraining these language
models based upon the results of the output, and we show how it is possible to improve
named entity recognition by iterations over both training data and development data.
We also consider how part-of-speech (POS) can be used within word lattices. We
devise a method of labelling a word lattice with POS tags and adapt the model to make
use of these POS tags when producing the best path through the lattice. The resulting
path provides the most likely sequence of words, entities and POS tags and we show
how this new path is better than the previous path which ignored the POS tags