22,825 research outputs found
Part of Speech Based Term Weighting for Information Retrieval
Automatic language processing tools typically assign to terms so-called
weights corresponding to the contribution of terms to information content.
Traditionally, term weights are computed from lexical statistics, e.g., term
frequencies. We propose a new type of term weight that is computed from part of
speech (POS) n-gram statistics. The proposed POS-based term weight represents
how informative a term is in general, based on the POS contexts in which it
generally occurs in language. We suggest five different computations of
POS-based term weights by extending existing statistical approximations of term
information measures. We apply these POS-based term weights to information
retrieval, by integrating them into the model that matches documents to
queries. Experiments with two TREC collections and 300 queries, using TF-IDF &
BM25 as baselines, show that integrating our POS-based term weights to
retrieval always leads to gains (up to +33.7% from the baseline). Additional
experiments with a different retrieval model as baseline (Language Model with
Dirichlet priors smoothing) and our best performing POS-based term weight, show
retrieval gains always and consistently across the whole smoothing range of the
baseline
Natural Language Processing for Information Retrieval and Knowledge Discovery
Natural Language Processing (NLP) is a powerful technology for the vital tasks of information retrieval (IR) and knowledge discovery (KD) which, in turn, feed the visualization systems of the present and future and enable knowledge workers to focus more of their time on the vital tasks of analysis and prediction.published or submitted for publicatio
Language-based multimedia information retrieval
This paper describes various methods and approaches for language-based multimedia information retrieval, which have been developed in the projects POP-EYE and OLIVE and which will be developed further in the MUMIS project. All of these project aim at supporting automated indexing of video material by use of human language technologies. Thus, in contrast to image or sound-based retrieval methods, where both the query language and the indexing methods build on non-linguistic data, these methods attempt to exploit advanced text retrieval technologies for the retrieval of non-textual material. While POP-EYE was building on subtitles or captions as the prime language key for disclosing video fragments, OLIVE is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which then serve as the basis for text-based retrieval functionality
Classifying Amharic News Text Using Self-Organizing Maps
The paper addresses using artificial neural networks for classification of Amharic news items. Amharic is the language for countrywide communication in Ethiopia and has its own writing system containing extensive systematic redundancy. It is quite dialectally diversified and probably representative of the languages of a continent that so far has received little attention within the language processing field.
The experiments investigated document clustering around user queries using Self-Organizing Maps, an unsupervised learning neural network strategy. The best ANN model showed a precision of 60.0% when trying to cluster unseen data, and a 69.5% precision when trying to classify it
- …