3,813 research outputs found
Extracting Noun Phrases from Large-Scale Texts: A Hybrid Approach and Its Automatic Evaluation
To acquire noun phrases from running texts is useful for many applications,
such as word grouping,terminology indexing, etc. The reported literatures adopt
pure probabilistic approach, or pure rule-based noun phrases grammar to tackle
this problem. In this paper, we apply a probabilistic chunker to deciding the
implicit boundaries of constituents and utilize the linguistic knowledge to
extract the noun phrases by a finite state mechanism. The test texts are
SUSANNE Corpus and the results are evaluated by comparing the parse field of
SUSANNE Corpus automatically. The results of this preliminary experiment are
encouraging.Comment: 8 pages, Postscript file, Unix compressed, uuencode
Determining the Unithood of Word Sequences using Mutual Information and Independence Measure
Most works related to unithood were conducted as part of a larger effort for
the determination of termhood. Consequently, the number of independent research
that study the notion of unithood and produce dedicated techniques for
measuring unithood is extremely small. We propose a new approach, independent
of any influences of termhood, that provides dedicated measures to gather
linguistic evidence from parsed text and statistical evidence from Google
search engine for the measurement of unithood. Our evaluations revealed a
precision and recall of 98.68% and 91.82% respectively with an accuracy at
95.42% in measuring the unithood of 1005 test cases.Comment: More information is available at
http://explorer.csse.uwa.edu.au/reference
A Study of Metrics of Distance and Correlation Between Ranked Lists for Compositionality Detection
Compositionality in language refers to how much the meaning of some phrase
can be decomposed into the meaning of its constituents and the way these
constituents are combined. Based on the premise that substitution by synonyms
is meaning-preserving, compositionality can be approximated as the semantic
similarity between a phrase and a version of that phrase where words have been
replaced by their synonyms. Different ways of representing such phrases exist
(e.g., vectors [1] or language models [2]), and the choice of representation
affects the measurement of semantic similarity.
We propose a new compositionality detection method that represents phrases as
ranked lists of term weights. Our method approximates the semantic similarity
between two ranked list representations using a range of well-known distance
and correlation metrics. In contrast to most state-of-the-art approaches in
compositionality detection, our method is completely unsupervised. Experiments
with a publicly available dataset of 1048 human-annotated phrases shows that,
compared to strong supervised baselines, our approach provides superior
measurement of compositionality using any of the distance and correlation
metrics considered
Building a Generation Knowledge Source using Internet-Accessible Newswire
In this paper, we describe a method for automatic creation of a knowledge
source for text generation using information extraction over the Internet. We
present a prototype system called PROFILE which uses a client-server
architecture to extract noun-phrase descriptions of entities such as people,
places, and organizations. The system serves two purposes: as an information
extraction tool, it allows users to search for textual descriptions of
entities; as a utility to generate functional descriptions (FD), it is used in
a functional-unification based generation system. We present an evaluation of
the approach and its applications to natural language generation and
summarization.Comment: 8 pages, uses eps
Automatic domain ontology extraction for context-sensitive opinion mining
Automated analysis of the sentiments presented in online consumer feedbacks can facilitate both organizations’ business strategy development and individual consumers’ comparison shopping. Nevertheless, existing opinion mining methods either adopt a context-free sentiment classification approach or rely on a large number of manually annotated training examples to perform context sensitive sentiment classification. Guided by the design science research methodology, we illustrate the design, development, and evaluation of a novel fuzzy domain ontology based contextsensitive opinion mining system. Our novel ontology extraction mechanism underpinned by a variant of Kullback-Leibler divergence can automatically acquire contextual sentiment knowledge across various product domains to improve the sentiment analysis processes. Evaluated based on a benchmark dataset and real consumer reviews collected from Amazon.com, our system shows remarkable performance improvement over the context-free baseline
Concept-based Interactive Query Expansion Support Tool (CIQUEST)
This report describes a three-year project (2000-03) undertaken in the Information Studies
Department at The University of Sheffield and funded by Resource, The Council for
Museums, Archives and Libraries. The overall aim of the research was to provide user
support for query formulation and reformulation in searching large-scale textual resources
including those of the World Wide Web. More specifically the objectives were: to investigate
and evaluate methods for the automatic generation and organisation of concepts derived from
retrieved document sets, based on statistical methods for term weighting; and to conduct
user-based evaluations on the understanding, presentation and retrieval effectiveness of
concept structures in selecting candidate terms for interactive query expansion.
The TREC test collection formed the basis for the seven evaluative experiments conducted in
the course of the project. These formed four distinct phases in the project plan. In the first
phase, a series of experiments was conducted to investigate further techniques for concept
derivation and hierarchical organisation and structure. The second phase was concerned with
user-based validation of the concept structures. Results of phases 1 and 2 informed on the
design of the test system and the user interface was developed in phase 3. The final phase
entailed a user-based summative evaluation of the CiQuest system.
The main findings demonstrate that concept hierarchies can effectively be generated from
sets of retrieved documents and displayed to searchers in a meaningful way. The approach
provides the searcher with an overview of the contents of the retrieved documents, which in
turn facilitates the viewing of documents and selection of the most relevant ones. Concept
hierarchies are a good source of terms for query expansion and can improve precision. The
extraction of descriptive phrases as an alternative source of terms was also effective. With
respect to presentation, cascading menus were easy to browse for selecting terms and for
viewing documents. In conclusion the project dissemination programme and future work are
outlined
Duration modeling with semi-Markov Conditional Random Fields for keyphrase extraction
Existing methods for keyphrase extraction need preprocessing to generate
candidate phrase or post-processing to transform keyword into keyphrase. In
this paper, we propose a novel approach called duration modeling with
semi-Markov Conditional Random Fields (DM-SMCRFs) for keyphrase extraction.
First of all, based on the property of semi-Markov chain, DM-SMCRFs can encode
segment-level features and sequentially classify the phrase in the sentence as
keyphrase or non-keyphrase. Second, by assuming the independence between state
transition and state duration, DM-SMCRFs model the distribution of duration
(length) of keyphrases to further explore state duration information, which can
help identify the size of keyphrase. Based on the convexity of parametric
duration feature derived from duration distribution, a constrained Viterbi
algorithm is derived to improve the performance of decoding in DM-SMCRFs. We
thoroughly evaluate the performance of DM-SMCRFs on the datasets from various
domains. The experimental results demonstrate the effectiveness of proposed
model
- …