7,364 research outputs found
MBT: A Memory-Based Part of Speech Tagger-Generator
We introduce a memory-based approach to part of speech tagging. Memory-based
learning is a form of supervised learning based on similarity-based reasoning.
The part of speech tag of a word in a particular context is extrapolated from
the most similar cases held in memory. Supervised learning approaches are
useful when a tagged corpus is available as an example of the desired output of
the tagger. Based on such a corpus, the tagger-generator automatically builds a
tagger which is able to tag new text the same way, diminishing development time
for the construction of a tagger considerably. Memory-based tagging shares this
advantage with other statistical or machine learning approaches. Additional
advantages specific to a memory-based approach include (i) the relatively small
tagged corpus size sufficient for training, (ii) incremental learning, (iii)
explanation capabilities, (iv) flexible integration of information in case
representations, (v) its non-parametric nature, (vi) reasonably good results on
unknown words without morphological analysis, and (vii) fast learning and
tagging. In this paper we show that a large-scale application of the
memory-based approach is feasible: we obtain a tagging accuracy that is on a
par with that of known statistical approaches, and with attractive space and
time complexity properties when using {\em IGTree}, a tree-based formalism for
indexing and searching huge case bases.} The use of IGTree has as additional
advantage that optimal context size for disambiguation is dynamically computed.Comment: 14 pages, 2 Postscript figure
Textpresso for Neuroscience: Searching the Full Text of Thousands of Neuroscience Research Papers
Textpresso is a text-mining system for scientific literature. Its two major features are access to the full text of research papers and the development and use of categories of biological concepts as well as categories that describe or relate objects. A search engine enables the user to search for one or a combination of these categories and/or keywords within an entire literature. Here we describe Textpresso for
Neuroscience, part of the core Neuroscience Information Framework
(NIF). The Textpresso site currently consists of 67,500 full text
papers and 131,300 abstracts. We show that using categories in
literature can make a pure keyword query more refined and meaningful.
We also show how semantic queries can be formulated with categories
only. We explain the build and content of the database and describe the
main features of the web pages and the advanced search options. We also
give detailed illustrations of the web service developed to provide
programmatic access to Textpresso. This web service is used by the NIF
interface to access Textpresso. The standalone website of Textpresso
for Neuroscience can be accessed at
http://www.textpresso.org/neuroscience
- …